How to work with highly skewed data -pandas dataframe - python

The dataset consist of 4000+ records. Here , trying to identify anomaly in 'duration' attribute. However, when the box plot is drown, can find that it is highly skewed. Tried to transform data, however results are not got. Attaching the boxplot below. How should we proceed in these cases.
Boxplot

What you could do is create a histogram of your plot and try to fit a distribution on your data. Suppose you were able to fit a standard normal distribution on your data, then you could read off anomalies in your data by checking the probability of the sample in your distribution. If this probability is smaller than a threshold probability p, then you could mark it as an anomoly.

Related

Python & Scipy: How to estimate a von mises scale factor with binned angle data?

I have an array of binned angle data, and another array of weights for each bin. I am using the vmpar() function found here in order to estimate the loc and kappa parameters. I then use the vmpdf() function, found in the same script, to create a von mises probability density function (pdf).
However, the vmpar function does not give me a scale parameter like the scipy vonmises.fit() function does. But I don't know how to use the vonmises.fit() function with binned data, since this function does not seem to accept weights as input.
My question is therefore: how do I estimate the scale from my binned angle data? The reason I want to adjust the scale is so that I can plot my original data and the pdf on the same graph. For example, now the pdf is not scaled to my original data, as seen in the image below (blue=original data, red line = pdf).
I am quite new to circular statistics, so perhaps there is a very easy way to implement this that I am overlooking. I need to figure out this asap, so I appreciate any help!

Quantify goodnes of non-linear fit - Python

I have a dataset with different quality data. There are A-grade, B-grade, C-grade and D-grade data, being the A-grade the best ones and the D-grade the ones with more scatter and uncertainty.
The data comes from a quasi periodic signal and, taking into consideration all the data, it covers just one cycle. If we only take into account the best data (A and B graded, in green and yellow) we don't cover the whole cycle but we are sure that we are only using the best data points.
After computing a periodogram to determine the period of the signal for both, the whole sample and only the A and B graded, I ended up with the following results: 5893 days and 4733 days respectively.
Using those values I fit the data to a sine and I plot them in the following plot:
Plot with the data
In the attached file the green points are the best ones and the red ones are the worst.
As you can see, the data only cover one cycle, and the red points (the worst ones) are crucial to cover that cycle, but they are not as good in quality. So I would like to know if the curve fit is better with those points or not.
I was trying to use the R² parameter but I have read that it only works properly for lineal functions...
How can I quantify which of those fits is better?
Thank you

How can i remove outliers in Python boxplot graph image?

When I was organizing my skewed distribution data to boxplot in python, it has a lot of outliers. I want to show only maximum & minimum outlier.
How can I make a code?
I don't want to remove my database. Just I want to show two outliers(Max, Min) in my graph image.
showfliers=False
or
plt.boxplot([data], showfliers=False)
try this ....

explanation of sklearn optics plot

I am currently learning how to use OPTICS in sklearn. I am inputting a numpy array of (205,22). I am able to get plots out of it, but I do not understand how I am getting a 2d plot out of multiple dimensions and how I am supposed to read it. I more or less understand the reachability plot, but the rest of it makes no sense to me. Can someone please explain what is happening. Is the function just simplifying the data to two dimensions somehow? Thank you
From the sklearn user guide:
The reachability distances generated by OPTICS allow for variable density extraction of clusters within a single data set. As shown in the above plot, combining reachability distances and data set ordering_ produces a reachability plot, where point density is represented on the Y-axis, and points are ordered such that nearby points are adjacent. ‘Cutting’ the reachability plot at a single value produces DBSCAN like results; all points above the ‘cut’ are classified as noise, and each time that there is a break when reading from left to right signifies a new cluster.
the other three plots are a visual representation of the actual clusters found by three different algorithms.
as you can see in the OPTICS Clustering plot there are two high density clusters (blue and cyan) the gray crosses acording to the reachability plot are classify as noise because of the low xi value
in the DBSCAN clustering with eps = 0.5 everithing is considered noise since the epsilon value is to low and the algorithm can not found any density points.
Now it is obvious that in the third plot the algorithm found just a single cluster because of the adjustment of the epsilon value and everything above the 2.0 line is considered noise.
please refer to the user guide:

How to find the gradient of grid-less data in python?

I have a pandas data frame containing location data (x_m and y_m) and another variable represented by the color bar in the figure.
Sample figure showing the data points and a possible gradient arrow
How can I obtain the average gradient of all of the data points in my data set? I drew one of the possible solutions showing the gradient vector.
Thank you!
EDIT:
I ended up using scipy.interpolate.griddata, similar to what was done here: https://earthscience.stackexchange.com/questions/12057/how-to-interpolate-scattered-data-to-a-regular-grid-in-python

Categories

Resources