The pandas.plot.kde() function is handy for plotting the estimated density function of a continuous random variable. It will take data x as input, and display the probabilities p(x) of the binned input as its output.
How can I extract the values of probabilities it computes? Instead of just plotting the probabilities of bandwidthed samples, I would like an array or pandas series that contains the probability values it internally computed.
If this can't be done with pandas kde, let me know of any equivalent in scipy or other
there are several ways to do that. You can either compute it yourself or get it from the plot.
As pointed out in the comment by #RichieV following this post, you can extract the data from the plot using
data.plot.kde().get_lines()[0].get_xydata()
Use seaborn and then the same as in 1):
You can use seaborn to estimate the kernel density and then matplotlib to extract the values (as in this post). You can either use distplot or kdeplot:
import seaborn as sns
# kde plot
x,y = sns.kdeplot(data).get_lines()[0].get_data()
# distplot
x,y = sns.distplot(data, hist=False).get_lines()[0].get_data()
You can use the underlying methods of scipy.stats.gaussian_kde to estimate the kernel density which is used by pandas:
import scipy.stats
density = scipy.stats.gaussian_kde(data)
and then you can use this to evaluate it on a set of points:
x = np.linspace(0,80,200)
y = density(xs)
Related
I visualize density function (PDF) using two plotting approaches: displot() and plot(). I don't understand why displot() doesn't produce normally distributed plot wheras plot() do this perfectly. Density plots should look alike but they don't. What's wrong with displot() here?
from scipy.stats import norm
import seaborn as sns
import numpy as np
data_x= np.arange(-4, 4, 0.001)
norm_pdf = norm.pdf(data_x)
sns.displot(data = norm_pdf, x = data_x, kind='kde')
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np
data_x= np.arange(-4, 4, 0.001)
plt.plot(data_x, norm.pdf(data_x))
plt.show()
displot (or the underlying kdeplot) creates an approximation of a probability density function (pdf) to resemble the function that might have generated the given random data. As input, you'll need random data. The function will mimic these data as a sum of Gaussian bell shapes (a "kernel density estimation" with a Gaussian kernel).
Here is an example using 8000 random points as input. You'll notice the curve resembles the normal pdf, but is also a bit "bumpier" (that's how randomness looks like).
data_x = norm.rvs(size=8000)
sns.kdeplot(x=data_x)
When you call kdeplot (or displot(..., kind='kde')) with both data= and x=, while x= isn't a columnname in a dataframe, data= gets ignored. So, you are using 8000 evenly distributed values between -4 and 4. The kde of such data looks like a flat line between -4 and 4. But as the kde supposes the underlying function locally resembles a Gaussian, the start and end are smoothed out.
data_x = np.arange(-4, 4, 0.001)
sns.kdeplot(x=data_x)
Here's the least-square quadratic fitting result of my data: y = 0.06(+/- 0.16)x**2-0.65(+/-0.04)x+1.2(+/-0.001). I wonder is there a direct way to plot the fit as well as the error band? I found a similar example which used plt.fill_between method. However, in that example the boundaries are known, while in my case I'm not quite sure about the exact parameters which correspond to the boundaries. I don't know if I could use plt.fill_between or a different approach. Thanks!
You can use seaborn.regplot to calculate the fit and plot it directly (order=2 is second order fit):
Here is a dummy example:
import seaborn as sns
import numpy as np
xs = np.linspace(0, 10, 50)
ys = xs**2+xs+1+np.random.normal(scale=20, size=50)
sns.regplot(x=xs, y=ys, order=2)
When plotting the estimated density function of my data using sns.kdeplot(), the algorithm extrapolates outside of the boundaries of the data, meaning that it draws the plot for values smaller than 0 or greater than 1, which is particularly annoying when dealing with probabilities. Example:
import numpy as np
import seaborn as sns
data = np.random.random(100)
sns.kdeplot(data)
How to fix that so that the actual plot remains within [0,1] in the x-direction without simply calling plt.xlim((0,1))?
I have some data in a pandas.Series, with a length of 10,000,000.
Plotting the histogram directly makes combines similar values together,
making them indistinguishable.
What is a proper way of visualizing the values?
Matplotlib, pandas and seaborn all provide a histogram function. Each of these functions always have an optional argument specifying the number of bins, i.e. the resolution of the histogram. Change this value untill you get to the resolution you like.
import pandas as pd
import numpy as np
data = np.random.randn(1000)
series = pd.Series(data)
#Change the number of bins to affect the histogram resolution
data.hist(bins = 100)
i try to plot data in a histogram or bar in python. The data size (array size) is between 0-10000. The data itself (each entry of the array) depends on the input and has a range between 0 and e+20 (mostly the data is in th same range). So i want to do a hist plot with matplotlib. I want to plot how often a data is in some intervall (to illustrate the mean and deviation). Sometimes it works like this:
hist1.
But sometimes there is a problem with the intevall size like this:
hist2.
In this plot i need more bars at point 0-100 etc.
Can anyone help me with this?
The plots are just made with:
from numpy.linalg import *
import matplotlib.pyplot as plt
plt.hist(numbers,bins=100)
plt.show()
By default, hist produces a plot with an x range that covers the full range of your data.
If you have one outsider at very high x in comparison with the other values, then you will see this image with a 'compressed' figure.
I you want to have always the same view you can fix the limits with xlim.
Alternatively, if you want to see your distribution always centered and as nicer as possible, you can calculate the mean and the standard deviation of your data and fix the x range accordingly (p.e. for mean +/- 5 stdev)