Approach for removing outliers of two dimensional data

Approach for removing outliers of two dimensional data - python

I am writing a Python program for finding areas of interest on a page. The positions on the page of all values of interest are given to me, but some values (typically only one or two) are far away from the others and I'd like to remove these. The data set is not huge, less than 100 data points but I will need to do this many times.
I have a cartesian coordinate system on two axes (x and y) in the first quadrant, so only positive values.
My data points represent boxes drawn on this coordinate system, which I have stored as a set of two coordinate pairs in a tuple. A box can be drawn by two coordinate pairs since all lines are straight. Example: (8, 2, 15, 10) would draw a box with indices (x,y) = (8,2), (8,10), (15,10) and (15,2).
I am trying to remove the outliers in this set but am having a hard time trying to figure out a good approach. I have thought about removing the outliers by finding the IQR and removing all points which fulfill these criteria:
Q1 - 1.5 * IQR or
Q3 + 1.5 * IQR
The problem here is that I am having a hard time figuring out how because the values are not just coordinates but areas if you will. However they are overlapping so they don't fit well in a histogram either.
First I thought I might add a point for each whole value that the box spans, the example box would in that case create 56 points. It seems to me as if this solution is quite bad. Does anyone have any alternative solutions?

Mainly there are two approaches: either you fixe the threshold value or you let machine learning infer it for you.
For machine learning, you can use Isolation Forest.
If you don't want ML then you have to fix yourself the threshold. So you can use a norm. There is no.linalg.norm(p1 - p2) or if you want more control on the metric there is cdist:
scipy.spatial.distance.cdist(p1, p2)

Related

trouble with scipy interpolation

I'm having trouble using the scipy interpolation methods to generate a nice smooth curve from the data points given. I've tried using the standard 1D interpolation, the Rbf interpolation with all options (cubic, gaussian, multiquadric etc.)
in the image provided, the blue line is the original data, and I'm looking to first smooth the sharp edges, and then have dynamically editable points from which to recalculate the curve. Each time a single point is edited it should auto calculate a new spline of some sort to smoothly transition between each point.
It kind of works when the points are within a particular range of each other as below.
But if the points end up too far apart, or too close together, I end up with issues like the following.
Key points are:
The curve MUST be flat between the first two points
The curve must NOT go below point 1 or 2 (i.e. derivative can't be negative)
~15 points (not shown) between points 2 and 3 are also editable and the line between is not necessarily linear. Full control over each of these points is a must, as is the curve going through each of them.
I'm happy to break it down into smaller curves that i then join/convolve, but just need to ensure a >0 gradient.
sample data:
x=[0, 37, 50, 105, 115,120]
y=[0.00965, 0.00965, 0.047850827205882, 0.35600416666667, 0.38074375, 0.38074375]
As an example, try moving point 2 (x=37) to an extreme value, say 10 (keep y the same). Just ensure that all points from x=0 to x=10 (or any other variation) have identical y values of 0.00965.
any assistance is greatly appreciated.
UPDATE
Attempted pchip method suggested in comments with the results below:
pchip method, better and worse...

Solved!
While I'm not sure that this is exactly true, it is as if the spline tools for creating Bezier curves treat the control points as points the calculated curve must go through - which is not true in my case. I couldn't figure out how to turn this feature off, so I found the cubic formula for a Bezier curve (cubic is what I need) and calculated my own points. I only then had to do a little adjustment to make the points fit the required integer x values - in my case, near enough is good enough. I would otherwise have needed to interpolate linearly between two points either side of the desired x value and determine the exact value.
For those interested, cubic needs 4 points - start, end, and 2 control points. The rule is:
B(t) = (1-t)^3 P0 + 3(1-t)^2 tP1 + 3(1-t)t^2 P2 + t^3 P3
Calculate for x and y separately, using a list of values for t. If you need to gradient match, just make sure that the control points for P1 and P2 are only moved along the same gradient as the preceding/proceeding sections.
Perfect result

Interpolating a line between two other lines in python [duplicate]

I'm sorry for the somewhat confusing title, but I wasn't sure how to sum this up any clearer.
I have two sets of X,Y data, each set corresponding to a general overall value. They are fairly densely sampled from the raw data. What I'm looking for is a way to find an interpolated X for any given Y for a value in between the sets I already have.
The graph makes this more clear:
In this case, the red line is from a set corresponding to 100, the yellow line is from a set corresponding to 50.
I want to be able to say, assuming these sets correspond to a gradient of values (even though they are clearly made up of discrete X,Y measurements), how do I find, say, where the X would be if the Y was 500 for a set that corresponded to a value of 75?
In the example here I would expect my desired point to be somewhere around here:
I do not need this function to be overly fancy — it can be simple linear interpolation of data points. I'm just having trouble thinking it through.
Note that neither the Xs nor the Ys of the two sets overlap perfectly. However it is rather trivial to say, "where are the nearest X points these sets share," or "where are the nearest Y points these sets share."
I have used simple interpolation between known values (e.g. find the X for corresponding Ys for set "50" and "100", then average those to get "75") and I end up with something like that looks like this:
So clearly I am doing something wrong here. Obviously in this case X is (correctly) returning as 0 for all of those cases where the Y is higher than the maximum Y of the "lowest" set. Things start out great but somewhere around when one starts to approach the maximum Y for the lowest set it starts going haywire.
It's easy to see why mine is going wrong. Here's another way to look at the problem:
In the "correct" version, X ought to be about 250. Instead, what I'm doing is essentially averaging 400 and 0 so X is 200. How do I solve for X in such a situation? I was thinking that bilinear interpolation might hold the answer but nothing I've been able to find on that has made it clear how I'd go about this sort of thing, because they all seem to be structured for somewhat different problems.
Thank you for your help. Note that while I have obviously graphed the above data in R to make it easy to see what I'm talking about, the final work for this is in Javascript and PHP. I'm not looking for something heavy duty; simple is better.

Good lord, I finally figured it out. Here's the end result:
Beautiful! But what a lot of work it was.
My code is too cobbled and too specific to my project to be of much use to anyone else. But here's the underlying logic.
You have to have two sets of data to interpolate from. I am calling these the "outer" curve and the "inner" curve. The "outer" curve is assumed to completely encompass, and not intersect with, the "inner" curve. The curves are really just sets of X,Y data, and correspond to a set of values defined as Z. In the example used here, the "outer" curve corresponds to Z = 50 and the "inner" curve corresponds to Z = 100.
The goal, just to reiterate, is to find X for any given Y where Z is some number in between our known points of data.
Start by figuring out the percentage between the two curve sets that the unknown Z represents. So if Z=75 in our example then that works out to be 0.5. If Z = 60 that would be 0.2. If Z = 90 then that would be 0.8. Call this proportion P.
Select the data point on the "outer" curve where Y = your desired Y. Imagine a line segment between that point and 0,0. Define that as AB.
We want to find where AB intersects with the "inner" curve. To do this, we iterate through each point on the inner curve. Define the line segment between the chosen point and the point+1 as CD. Check if AB and CD intersect. If not, continue iterating until they do.
When we find an AB-CD intersection, we now look at the line created by the intersection and our original point on the "outer" curve from step 2. This line segment, then, is a line between the inner and outer curve where the slope of the line, were it to be continued "down" the chart, would intersect with 0,0. Define this new line segment as EF.
Find the position at P percent (from step 1) of the length of EF. Check the Y value. Is it our desired Y value? If it is (unlikely), return the X of that point. If not, see if Y is less than the goal Y. If it is, store the position of that point in a variable, which I'll dub lowY. Then go back to step 2 again for the next point on the outer curve. If it is greater than the goal Y, see if lowY has a value in it. If it does, interpolate between the two values and return the interpolated X. (We have "boxed in" our desired coordinate, in other words.)
The above procedure works pretty well. It fails in the case of Y=0 but it is easy to do that one since you can just do interpolation on those two specific points. In places where the number of sample is much less, it produces kind of jaggy results, but I guess that's to be expected (these are Z = 5000,6000,7000,8000,9000,10000, where only 5000 and 10000 are known points and they have only 20 datapoints each — the rest are interpolated):
I am under no pretensions that this is an optimized solution, but solving for gobs of points is practically instantaneous on my computer so I assume it is not too taxing for a modern machine, at least with the number of total points I have (30-50 per curve).
Thanks for everyone's help; it helped a lot to talk this through a bit and realize that what I was really going for here was not any simple linear interpolation but a kind of "radial" interpolation along the curve.

How to correlate two time series with gaps and different time bases?

I have two time series of 3D accelerometer data that have different time bases (clocks started at different times, with some very slight creep during the sampling time), as well as containing many gaps of different size (due to delays associated with writing to separate flash devices).
The accelerometers I'm using are the inexpensive GCDC X250-2. I'm running the accelerometers at their highest gain, so the data has a significant noise floor.
The time series each have about 2 million data points (over an hour at 512 samples/sec), and contain about 500 events of interest, where a typical event spans 100-150 samples (200-300 ms each). Many of these events are affected by data outages during flash writes.
So, the data isn't pristine, and isn't even very pretty. But my eyeball inspection shows it clearly contains the information I'm interested in. (I can post plots, if needed.)
The accelerometers are in similar environments but are only moderately coupled, meaning that I can tell by eye which events match from each accelerometer, but I have been unsuccessful so far doing so in software. Due to physical limitations, the devices are also mounted in different orientations, where the axes don't match, but they are as close to orthogonal as I could make them. So, for example, for 3-axis accelerometers A & B, +Ax maps to -By (up-down), +Az maps to -Bx (left-right), and +Ay maps to -Bz (front-back).
My initial goal is to correlate shock events on the vertical axis, though I would eventually like to a) automatically discover the axis mapping, b) correlate activity on the mapped aces, and c) extract behavior differences between the two accelerometers (such as twisting or flexing).
The nature of the times series data makes Python's numpy.correlate() unusable. I've also looked at R's Zoo package, but have made no headway with it. I've looked to different fields of signal analysis for help, but I've made no progress.
Anyone have any clues for what I can do, or approaches I should research?
Update 28 Feb 2011: Added some plots here showing examples of the data.

My interpretation of your question: Given two very long, noisy time series, find a shift of one that matches large 'bumps' in one signal to large bumps in the other signal.
My suggestion: interpolate the data so it's uniformly spaced, rectify and smooth the data (assuming the phase of the fast oscillations is uninteresting), and do a one-point-at-a-time cross correlation (assuming a small shift will line up the data).
import numpy
from scipy.ndimage import gaussian_filter
"""
sig1 and sig 2 are assumed to be large, 1D numpy arrays
sig1 is sampled at times t1, sig2 is sampled at times t2
t_start, t_end, is your desired sampling interval
t_len is your desired number of measurements
"""
t = numpy.linspace(t_start, t_end, t_len)
sig1 = numpy.interp(t, t1, sig1)
sig2 = numpy.interp(t, t2, sig2)
#Now sig1 and sig2 are sampled at the same points.
"""
Rectify and smooth, so 'peaks' will stand out.
This makes big assumptions about your data;
these assumptions seem true-ish based on your plots.
"""
sigma = 10 #Tune this parameter to get the right smoothing
sig1, sig2 = abs(sig1), abs(sig2)
sig1, sig2 = gaussian_filter(sig1, sigma), gaussian_filter(sig2, sigma)
"""
Now sig1 and sig2 should look smoothly varying, with humps at each 'event'.
Hopefully we can search a small range of shifts to find the maximum of the
cross-correlation. This assumes your data are *nearly* lined up already.
"""
max_xc = 0
best_shift = 0
for shift in range(-10, 10): #Tune this search range
xc = (numpy.roll(sig1, shift) * sig2).sum()
if xc > max_xc:
max_xc = xc
best_shift = shift
print 'Best shift:', best_shift
"""
If best_shift is at the edges of your search range,
you should expand the search range.
"""

If the data contains gaps of unknown sizes that are different in each time series, then I would give up on trying to correlate entire sequences, and instead try cross correlating pairs of short windows on each time series, say overlapping windows twice the length of a typical event (300 samples long). Find potential high cross correlation matches across all possibilities, and then impose a sequential ordering constraint on the potential matches to get sequences of matched windows.
From there you have smaller problems that are easier to analyze.

This isn't a technical answer, but it might help you come up with one:
Convert the plot to an image, and stick it into a decent image program like gimp or photoshop
break the plots into discrete images whenever there's a gap
put the first series of plots in a horizontal line
put the second series in a horizontal line right underneath it
visually identify the first correlated event
if the two events are not lined up vertically:
select whichever instance is further to the left and everything to the right of it on that row
drag those things to the right until they line up
This is pretty much how an audio editor works, so you if you converted it into a simple audio format like an uncompressed WAV file, you could manipulate it directly in something like Audacity. (It'll sound horrible, of course, but you'll be able to move the data plots around pretty easily.)
Actually, audacity has a scripting language called nyquist, too, so if you don't need the program to detect the correlations (or you're at least willing to defer that step for the time being) you could probably use some combination of audacity's markers and nyquist to automate the alignment and export the clean data in your format of choice once you tag the correlation points.

My guess is, you'll have to manually build an offset table that aligns the "matches" between the series. Below is an example of a way to get those matches. The idea is to shift the data left-right until it lines up and then adjust the scale until it "matches". Give it a try.
library(rpanel)
#Generate the x1 and x2 data
n1 <- rnorm(500)
n2 <- rnorm(200)
x1 <- c(n1, rep(0,100), n2, rep(0,150))
x2 <- c(rep(0,50), 2*n1, rep(0,150), 3*n2, rep(0,50))
#Build the panel function that will draw/update the graph
lvm.draw <- function(panel) {
plot(x=(1:length(panel$dat3))+panel$off, y=panel$dat3, ylim=panel$dat1, xlab="", ylab="y", main=paste("Alignment Graph Offset = ", panel$off, " Scale = ", panel$sca, sep=""), typ="l")
lines(x=1:length(panel$dat3), y=panel$sca*panel$dat4, col="red")
grid()
panel
}
#Build the panel
xlimdat <- c(1, length(x1))
ylimdat <- c(-5, 5)
panel <- rp.control(title = "Eye-Ball-It", dat1=ylimdat, dat2=xlimdat, dat3=x1, dat4=x2, off=100, sca=1.0, size=c(300, 160))
rp.slider(panel, var=off, from=-500, to=500, action=lvm.draw, title="Offset", pos=c(5, 5, 290, 70), showvalue=TRUE)
rp.slider(panel, var=sca, from=0, to=2, action=lvm.draw, title="Scale", pos=c(5, 70, 290, 90), showvalue=TRUE)

It sounds like you want to minimize the function (Ax'+By) + (Az'+Bx) + (Ay'+Bz) for a pair of values: Namely, the time-offset: t0 and a time scale factor: tr. where Ax' = tr*(Ax + t0), etc..
I would look into SciPy's bivariate optimize functions. And I would use a mask or temporarily zero the data (both Ax' and By for example) over the "gaps" (assuming the gaps can be programmatically determined).
To make the process more efficient, start with a coarse sampling of A and B, but set the precision in fmin (or whatever optimizer you've selected) that is commensurate with your sampling. Then proceed with progressively finer-sampled windows of the full dataset until your windows are narrow and are not down-sampled.
Edit - matching axes
Regarding the issue of trying to identify which axis is co-linear with a given axis, and not knowing at thing about the characteristics of your data, i can point towards a similar question. Look into pHash or any of the other methods outlined in this post to help identify similar waveforms.

Finding all points common to two circles

In Python, how would one find all integer points common to two circles?
For example, imagine a Venn diagram-like intersection of two (equally sized) circles, with center-points (x1,y1) and (x2,y2) and radii r1=r2. Additionally, we already know the two points of intersection of the circles are (xi1,yi1) and (xi2,yi2).
How would one generate a list of all points (x,y) contained in both circles in an efficient manner? That is, it would be simple to draw a box containing the intersections and iterate through it, checking if a given point is within both circles, but is there a better way?

Keep in mind that there are four cases here.
Neither circle intersects, meaning the "common area" is empty.
One circle resides entirely within the other, meaning the "common area" is the smaller/interior circle. Also note that a degenerate case of this is if they are the same concentric circle, which would have to be the case given the criteria that they are equal-diameter circles that you specified.
The two circles touch at one intersection point.
The "general" case where there are going to be two intersection points. From there, you have two arcs that define the enclosed area. In that case, the box-drawing method could work for now, I'm not sure there's a more efficient method for determining what is contained by the intersection. Do note, however, if you're just interested in the area, there is a formula for that.

You may also want to look into the various clipping algorithms used in graphics development. I have used clipping algorithms to solve alot of problems similar to what you are asking here.

If the locations and radii of your circles can vary with a granularity less than your grid, then you'll be checking a bunch of points anyway.
You can minimize the number of points you check by defining the search area appropriately. It has a width equal to the distance between the points of intersection, and a height equal to
r1 + r2 - D
with D being the separation of the two centers. Note that this rectangle in general is not aligned with the X and Y axes. (This also gives you a test as to whether the two circles intersect!)
Actually, you'd only need to check half of these points. If the radii are the same, you'd only need to check a quarter of them. The symmetry of the problem helps you there.

You're almost there.
Iterating over the points in the box should be fairly good, but you can do better if for the second coordinate you iterate directly between the limits.
Say you iterate along the x axis first, then for the y axis, instead of iterating between bounding box coords figure out where each circle intersects the x line, more specifically you are interested in the y coordinate of the intersection points, and iterate between those (pay attention to rounding)
When you do this, because you already know you are inside the circles you can skip the checks entirely.
If you have a lot of points then you skip a lot of checks and you might get some performance improvements.
As an additional improvement you can pick the x axis or the y axis to minimize the number of times you need to compute intersection points.

So you want to find the lattice points that are inside both circles?
The method you suggested of drawing a box and iterating through all the points in the box seems the simplest to me. It will probably be efficient, as long as the number of points in the box is comparable to the number of points in the intersection.
And even if it isn't as efficient as possible, you shouldn't try to optimize it until you have a good reason to believe it's a real bottleneck.

I assume by "all points" you mean "all pixels". Suppose your display is NX by NY pixels. Have two arrays
int x0[NY], x1[NY]; initially full of -1.
The intersection is lozenge-shaped, between two curves.
Iterate x,y values along each curve. At each y value (that is, where the curve crosses y + 0.5), store the x value in the array. If x0[y] is -1, store it in x0, else store it in x1.
Also keep track of the lowest and highest values of y.
When you are done, just iterate over the y values, and at each y, iterate over the x values between x0 and x1, that is, for (ix = x0[iy]; ix < x1[iy]; ix++) (or the reverse).
It's important to understand that pixels are not the points where x and y are integers. Rather pixels are the little squares between the grid lines. This will prevent you from having edge-case problems.

Estimating the boundary of arbitrarily distributed data

I have two dimensional discrete spatial data. I would like to make an approximation of the spatial boundaries of this data so that I can produce a plot with another dataset on top of it.
Ideally, this would be an ordered set of (x,y) points that matplotlib can plot with the plt.Polygon() patch.
My initial attempt is very inelegant: I place a fine grid over the data, and where data is found in a cell, a square matplotlib patch is created of that cell. The resolution of the boundary thus depends on the sampling frequency of the grid. Here is an example, where the grey region are the cells containing data, black where no data exists.
1st attempt http://astro.dur.ac.uk/~dmurphy/data_limits.png
OK, problem solved - why am I still here? Well.... I'd like a more "elegant" solution, or at least one that is faster (ie. I don't want to get on with "real" work, I'd like to have some fun with this!). The best way I can think of is a ray-tracing approach - eg:
from xmin to xmax, at y=ymin, check if data boundary crossed in intervals dx
y=ymin+dy, do 1
do 1-2, but now sample in y
An alternative is defining a centre, and sampling in r-theta space - ie radial spokes in dtheta increments.
Both would produce a set of (x,y) points, but then how do I order/link neighbouring points them to create the boundary?
A nearest neighbour approach is not appropriate as, for example (to borrow from Geography), an isthmus (think of Panama connecting N&S America) could then close off and isolate regions. This also might not deal very well with the holes seen in the data, which I would like to represent as a different plt.Polygon.
The solution perhaps comes from solving an area maximisation problem. For a set of points defining the data limits, what is the maximum contiguous area contained within those points To form the enclosed area, what are the neighbouring points for the nth point? How will the holes be treated in this scheme - is this erring into topology now?
Apologies, much of this is me thinking out loud. I'd be grateful for some hints, suggestions or solutions. I suspect this is an oft-studied problem with many solution techniques, but I'm looking for something simple to code and quick to run... I guess everyone is, really!
~~~~~~~~~~~~~~~~~~~~~~~~~
OK, here's attempt #2 using Mark's idea of convex hulls:
alt text http://astro.dur.ac.uk/~dmurphy/data_limitsv2.png
For this I used qconvex from the qhull package, getting it to return the extreme vertices. For those interested:
cat [data] | qconvex Fx > out
The sampling of the perimeter seems quite low, and although I haven't played much with the settings, I'm not convinced I can improve the fidelity.

I think what you are looking for is the Convex Hull of the data That will give a set of points that if connected will mean that all your points are on or inside the connected points

I may have mixed something, but what's the motivation for simply not determining the maximum and minimum x and y level? Unless you have an enormous amount of data you could simply iterate through your points determining minimum and maximum levels fairly quickly.
This isn't the most efficient example, but if your data set is small this won't be particularly slow:
import random
data = [(random.randint(-100, 100), random.randint(-100, 100)) for i in range(1000)]
x_min = min([point[0] for point in data])
x_max = max([point[0] for point in data])
y_min = min([point[1] for point in data])
y_max = max([point[1] for point in data])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.