How to detect and eliminate outliers from a changing dataset

How to detect and eliminate outliers from a changing dataset - python

I have a dataset which contains pixel values of a certain object by frame. My code can detect the object accurately most of the time; yet, there are negatives.
I plotted first 600 values (x-axis: frame number, y-axis: pixel location of object). In first image, you can see raw data; in second image, you can see correct path.
I already tried mean and median-filtering with different parameters, I couldn't get anything useful. Is there any way/algorithm to replace outliers with correct values?

RANSAC is a technique to ignore outliers and select only inliers for any computation.
Since this case does not have a mathematical function to fit in the data, you cannot apply RANSAC directly.
But, as a work around, by looking into the data graph, you can try to fit a line in every 20 frames of data, and remove all outliers at each interval. This might help to ignore and reduce the effect of total outliers.

Related

how to find the exact data value on click event or mouse hover in a time series graph drawn by datashader and holoview

Here I am trying to visualize 1bn data.
The below scatter plot represents graph of time value pair.
eg:
df:
TIME,VAL
145000000, 1.464000,
150000000, 1.466000,
155000000, 1.461250,
160000000, 1.481750,
165000000, 1.493500,
170000000, 1.514500,
175000000, 1.524000,
180000000, 1.543750,
185000000, 1.553750,
190000000, 1.582000,
195000000, 1.594000,
200000000, 1.625000,
205000000, 1.639500,
210000000, 1.679250,
215000000, 1.697250,
220000000, 1.720000,
I need to find the exact time value pair that is being mapped to (x, y) point.
Is there any way to find the real time value of particular (x, y) click on raster image being rendered on screen

If you're looking to find the individual row contributing to a specific point, the answer is that you can't.
Unlike matplotlib, datashader is not rendering individual points. Instead, it first defines the image boundaries, then using the requested number of pixels, computes the range of values in (x, y) which fall into each pixel. It then bins/discretizes your data, so the rendering engine is only working with summary statistics for each pixel - not the individual values from your source data. This is what makes datashader so powerful when rendering huge datasets, but it also means that nowhere is there a mapping from rows to pixels.
You could of course identify the boundaries of a given pixel and then filter your dataset to pull all rows with data falling into these bounds. But there's no guarantee that the match will be unique (this depends on your data).

Michael Delgado is correct that the rendered image doesn't contain information about the rows, but (a) you can use the HoloViews "inspect" tools to look up the original datapoints mapped into that pixel (automating the process he describes; see https://examples.pyviz.org/ship_traffic), and (b) it's on our list to provide such an inverse mapping from pixel to rows, with the constraint that only n datapoints can be returned for each pixel (see proposal in https://github.com/holoviz/datashader/issues/1126). Once we have such a mapping, it should be trivial to provide hover and click information in holoviews for datashader plots without the cost of searching through the original dataset. Wish us luck, and in the meantime, use inspect_points!

How to alter a dataset to match another similar -warped- one by using the existing intersection between them?

I have two coordinate systems for each record in my dataset. Lat-lon coordinates and what I suppose is utm x-y coordinates.
50% of my dataset only has x-y data without lat-lon, viceversa is 6%.
There is a good portion of the dataset that has both (33%) for each single record.
I wanted to know if there is a way to take advantage of the intersection (and maybe the x-y only part, since it's the biggest) to obtain a full dataset with only one coordinate system that makes sense. The problem is that after a little bit of preprocessing, they look "relaxed" in a different way and the intersection doesn't really match. The scatter plot shows what I believe to be a non linear, warped relationship from one system of coordinates to another. With this, I mean that normalizing them both to [0;1] and centering them to (0,0) (by subtracting the mean), gives two slightly different point distributions, and apparently a coefficient multiplication to scale one down to look like the other is not enough to get them to match completely. Looks like some more complicated relationship between the two.
I also tried to use an external library called utm to convert the lat-long coordinates to x-y to have a third pair of attributes (let's call it my_xy), only to find out that is not matching even one of the first two systems, instead it shows another slight warp.
Notes: When I say I do not have data from one coordinate system, assume NaN.
Furthermore, I know the warping could be a result of the fundamental geometrical differences between latlon and xy systems, but I still do not know what else I could try, if the utm conversion and the scaling did not work.
Blue: latlon, Red: original xy, Green: my_xy calculated from latlon

Dealing with empty regions of data in a histogram

I have 2 arrays of shape (1,N) which are right ascension and declination. The scatter plot of that data is
As you can see in the top-left, data was not collected in this region and so is empty.
I would like to form the histogram of this data as a method of investigating data clustering. That empty spot (and many others like it), it seems to me, will cause a problem -- when numpy.histogram2d draws a grid on this data and begins counting data points in the cells, it will see cells that fall on the empty region and determine that there is no data there; hence the cell histogram value is zero. This pulls down the mean of the whole histogram. Having a sensible mean is important because I identify clusters via their standard deviation from the mean.
The problem is, of course, not that those cells are empty, but that there is no data to count; ideally those cells should be ignored.

Separating Meshes with vtkPolyDataConnectivityFilter

I am having a hard time using vtk, especially the vtkPolyDataConnectivityFilter. I feed in the output of a Marching Cubes algorithm that created a surface from a 3d point cloud.
However, when i try to set
filt = vtk.vtkConnectivityFilter()
filt.SetInputData(surface_data) # get the data from the MC alg.
filt.SetExtractionModeToLargestRegion()
filt.ColorRegionsOn()
filt.Update()
filt.GetNumberOfExtractedRegions() # will 53 instead of 1
it gives me weird results. I cannot use the extraction modes for specific regions or seed a single point, since i don't know them in advance.
I need to separate the points of the largest mesh from the smaller ones and keep only the large mesh.
When i render the whole output it shows me the right extracted region. However, the different regions are still contained in the dataset and there is no way to separate it.
What am i doing wrong?
Best J

I had the same problem where I had to segment an STL file converted to vtkpolydata.
If you look at the example https://www.vtk.org/Wiki/VTK/Examples/Cxx/PolyData/PolyDataConnectivityFilter_SpecifiedRegion , you will find they use the member function SetExtractionModeToSpecifiedRegions().
Replace you code with the following:
filt.SetInputData(surface_data)
filt.SetExtractionModeToSpecifiedRegions()
filt.AddSpecifiedRegion(0) #Manually increment from 0 up to filt.GetNumberOfExtractedRegions()
filt.Update()
You will need to render and view the specified region to figure out the index of the segmented region your actually interested in.

Higher sampling for image's projection

My software should judge spectrum bands, and given the location of the bands, find the peak point and width of the bands.
I learned to take the projection of the image and to find width of each peak.
But I need a better way to find the projection.
The method I used reduces a 1600-pixel wide image (eg 1600X40) to a 1600-long sequence. Ideally I would want to reduce the image to a 10000-long sequence using the same image.
I want a longer sequence as 1600 points provide too low resolution. A single point causes a large difference (there is a 4% difference if a band is judged from 18 to 19) to the measure.
How do I get a longer projection from the same image?
Code I used: https://stackoverflow.com/a/9771560/604511
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("band2.png"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Set the min value to zero for a nice fit
projection /= projection.mean()
projection -= projection.min()

What you want to do is called interpolation. Scipy has an interpolate module, with a whole bunch of different functions for differing situations, take a look here, or specifically for images here.
Here is a recently asked question that has some example code, and a graph that shows what happens.
But it is really important to realise that interpolating will not make your data more accurate, so it will not help you in this situation.
If you want more accurate results, you need more accurate data. There is no other way. You need to start with a higher resolution image. (If you resample, or interpolate, you results will acually be less accurate!)
Update - as the question has changed
#Hooked has made a nice point. Another way to think about it is that instead of immediately averaging (which does throw away the variance in the data), you can produce 40 graphs (like your lower one in your posted image) from each horizontal row in your spectrum image, all these graphs are going to be pretty similar but with some variations in peak position, height and width. You should calculate the position, height, and width of each of these peaks in each of these 40 images, then combine this data (matching peaks across the 40 graphs), and use the appropriate variance as an error estimate (for peak position, height, and width), by using the central limit theorem. That way you can get the most out of your data. However, I believe this is assuming some independence between each of the rows in the spectrogram, which may or may not be the case?

I'd like to offer some more detail to #fraxel's answer (to long for a comment). He's right that you can't get any more information than what you put in, but I think it needs some elaboration...
You are projecting your data from 1600x40 -> 1600 which seems like you are throwing some data away. While technically correct, the whole point of a projection is to bring higher dimensional data to a lower dimension. This only makes sense if...
Your data can be adequately represented in the lower dimension. Correct me if I'm wrong, but it looks like your data is indeed one-dimensional, the vertical axis is a measure of the variability of that particular point on the x-axis (wavelength?).
Given that the projection makes sense, how can we best summarize the data for each particular wavelength point? In my previous answer, you can see I took the average for each point. In the absence of other information about the particular properties of the system, this is a reasonable first-order approximation.
You can keep more of the information if you like. Below I've plotted the variance along the y-axis. This tells me that your measurements have more variability when the signal is higher, and low variability when the signal is lower (which seems useful!):
What you need to do then, is decide what you are going to do with those extra 40 pixels of data before the projection. They mean something physically, and your job as a researcher is to interpret and project that data in a meaningful way!
The code to produce the image is below, the spec. data was taken from the screencap of your original post:
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("spec2.png"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Compute the variance
variance = pic_avg.var(axis=0)
from pylab import *
scale = 1/40.
X_val = range(projection.shape[0])
errorbar(X_val,projection*scale,yerr=variance*scale)
imshow(pic,origin='lower',alpha=.8)
axis('tight')
show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.