Dealing with empty regions of data in a histogram - python

I have 2 arrays of shape (1,N) which are right ascension and declination. The scatter plot of that data is
As you can see in the top-left, data was not collected in this region and so is empty.
I would like to form the histogram of this data as a method of investigating data clustering. That empty spot (and many others like it), it seems to me, will cause a problem -- when numpy.histogram2d draws a grid on this data and begins counting data points in the cells, it will see cells that fall on the empty region and determine that there is no data there; hence the cell histogram value is zero. This pulls down the mean of the whole histogram. Having a sensible mean is important because I identify clusters via their standard deviation from the mean.
The problem is, of course, not that those cells are empty, but that there is no data to count; ideally those cells should be ignored.

Related

how to find the exact data value on click event or mouse hover in a time series graph drawn by datashader and holoview

Here I am trying to visualize 1bn data.
The below scatter plot represents graph of time value pair.
eg:
df:
TIME,VAL
145000000, 1.464000,
150000000, 1.466000,
155000000, 1.461250,
160000000, 1.481750,
165000000, 1.493500,
170000000, 1.514500,
175000000, 1.524000,
180000000, 1.543750,
185000000, 1.553750,
190000000, 1.582000,
195000000, 1.594000,
200000000, 1.625000,
205000000, 1.639500,
210000000, 1.679250,
215000000, 1.697250,
220000000, 1.720000,
I need to find the exact time value pair that is being mapped to (x, y) point.
Is there any way to find the real time value of particular (x, y) click on raster image being rendered on screen
If you're looking to find the individual row contributing to a specific point, the answer is that you can't.
Unlike matplotlib, datashader is not rendering individual points. Instead, it first defines the image boundaries, then using the requested number of pixels, computes the range of values in (x, y) which fall into each pixel. It then bins/discretizes your data, so the rendering engine is only working with summary statistics for each pixel - not the individual values from your source data. This is what makes datashader so powerful when rendering huge datasets, but it also means that nowhere is there a mapping from rows to pixels.
You could of course identify the boundaries of a given pixel and then filter your dataset to pull all rows with data falling into these bounds. But there's no guarantee that the match will be unique (this depends on your data).
Michael Delgado is correct that the rendered image doesn't contain information about the rows, but (a) you can use the HoloViews "inspect" tools to look up the original datapoints mapped into that pixel (automating the process he describes; see https://examples.pyviz.org/ship_traffic), and (b) it's on our list to provide such an inverse mapping from pixel to rows, with the constraint that only n datapoints can be returned for each pixel (see proposal in https://github.com/holoviz/datashader/issues/1126). Once we have such a mapping, it should be trivial to provide hover and click information in holoviews for datashader plots without the cost of searching through the original dataset. Wish us luck, and in the meantime, use inspect_points!

Splitting a graph

I have an array of 575 points.
When I represent it on graph, I get following curve shown in the attached image.
I want to split it in sub graphs when slope becomes 0 or you can say when the graph becomes parallel to x-axis.
Thanks in advance.
I understand you want to have some degree of smoothness, otherwise you will have as a result many small separated regions of the graph.
You also may need to specifically define what you want to consider as parallel to the x-axis.
I suggest to start by moving a running window of certain length that categorizes each range being studied as horizontal given certain condition.
This condition can be something like "all values are inside certain range". This condition may take into account characteristics like the variance and the mean of the points inside the window. For example, "all values are between 101% and 99% of the mean."

Laplace interpolation between known values in a matrix

I'm working on a heatmap generation program which hopefully will fill in the colors based on value samples provided from a building layout (this is not GPS based).
If I have only a few known data points such as these in a large matrix of unknowns, how do I get the values in between interpolated in Python?:
0,0,0,0,1,0,0,0,0,0,5,0,0,0,0,9
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,2,0,0,0,0,0,0,0,0,8,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,8,0,0,0,0,0,0,0,6,0,0,0,0,0,0
0,0,0,0,0,3,0,0,0,0,0,0,0,0,7,0
I understand that bilinear won't do it, and Gaussian will bring all the peaks down to low values due to the sheer number of surrounding zeros. This is obviously a matrix handling proposition, and I don't need it to be Bezier curve smooth, just close enough to be a graphic representation would be fine. My matrix will end up being about 1500×900 cells in size, with approximately 100 known points.
Once the values are interpolated, I have written code to convert it all to colors, no problem. It's just that right now I'm getting single colored pixels sprinkled over a black background.
Proposing a naive solution:
Step 1: interpolate and extrapolate existing data points onto surroundings.
This can be done using "wave propagation" type algorithm.
The known points "spread out" their values onto surroundings until all the grid is "flooded" with some known values. At the end of this stage you have a number of intersected "disks", and no zeroes left.
Step 2: smoothen the result (using bilinear filtering or some other filtering).
If you are able to use ScyPy, then interp2d does exactly what you want. A possible problem with is that it seems to not extrapolate smoothly according to this issue. This means that all values near the walls are going to be the same as closest their neighbour points. This can be solved by putting thermometers in all 4 corners :)

How to detect and eliminate outliers from a changing dataset

I have a dataset which contains pixel values of a certain object by frame. My code can detect the object accurately most of the time; yet, there are negatives.
I plotted first 600 values (x-axis: frame number, y-axis: pixel location of object). In first image, you can see raw data; in second image, you can see correct path.
I already tried mean and median-filtering with different parameters, I couldn't get anything useful. Is there any way/algorithm to replace outliers with correct values?
RANSAC is a technique to ignore outliers and select only inliers for any computation.
Since this case does not have a mathematical function to fit in the data, you cannot apply RANSAC directly.
But, as a work around, by looking into the data graph, you can try to fit a line in every 20 frames of data, and remove all outliers at each interval. This might help to ignore and reduce the effect of total outliers.

Calculating and Plotting 2nd moment of image

I am trying to plot the 2nd moments onto a image file (the image file is a numpy array for brightness distribution). I have a rough understanding that 2nd moment is sort of like moment of inertia (Ixx,Iyy) which is a tensor but I am not too sure how to calculate it and how it would translate into two intersecting lines with the centroid at its intersection. I tried using scipy.stats.mstats.moment but I am unsure what to put as axis if I just want two 2nd moments that intersects at centroid.
Also it returns an array but I am not exactly sure what the values in the array signify, and how that relate to what I am going to plot (because the scatter method in the plotting module takes in at least 2 corresponding values in order to be plotted) ?
Thank you.

Categories

Resources