Kernel Density Estimation in Python - python

I have a list of counts ('y' in the code below), that I am using to plot a probability distribution - so note it is not raw data but really frequencies that I have already calculated which should fall across various bins. A scatter plot and even a histogram (plotted with the bar function) revealed that it was some manner of bimodal distribution. I wanted to be able to fit a pdf to this so I first tried just a sum of gaussians but the curve fitting algorithm in SciPy was unsuccessful in fitting the curve. I then came across Kernel Density Estimation which from what I have read is the best way to achieve this but for some reason, even after putting together code from here at stack overflow from an answer to a similar question and also from a different website, both of which recommended using the gaussian_kde function from scipy.stats, I have so far been unsuccessful in being able to do so. Am I wrong in assuming that I can do this for what I have in the first place? If I am correct, what can I do to get it right?
x = np.linspace(x_min, x_max, n_bins)
y = np.array(normed_pdf)
plt.scatter(x,y,s=5, label='Sim Data')
plt.hold('on')
kde = gaussian_kde(y, bw_method=0.1 / y.std(ddof=1))
kde.covariance_factor = lambda : .25
kde._compute_covariance()
plt.plot(x, kde(x), 'r-', label='fit')
plt.hold('off')
plt.grid(True)
plt.legend(prop={'size':10})
plt.show()
I know that I might as well use R or GNUPlot or some other tool to do this but I want to be able to do it within Python. Call me a stickler self-contained, consistent code.

Related

evaluate numerically the density that `sns.kdeplot` has put in the plot?

Out of the box seaborn does a very good job to plot a 2D KDE or jointplot. However it is not returning anything like a function that I can evaluate to numerically read the values of the estimated density.
How can I evaluate numerically the density that sns.kdeplot or jointplot has put in the plot?
Just for completeness. I see something interesting in the scipy docs, stats.gaussian_kde but I am getting very clunky density plots,
which for some reason because of missing extent are really off compared to the scatter plot. So I would like to stay away from the scipy kde, at least until I figure how to make it work why pyplot is so much more "not smart" as seaborn is.
Anyhow, the evaluate method of the scipy.stats.gaussian_kde does its job.
I also faced this issue in jointplot() method. I opened a file distribution.py on this path anaconda3/lib/python3.7/site-packages/seaborn/. Then I added these lines in _bivariate_kdeplot() function:
print("xx=",xx[50])
print("yy=",yy[:,50])
print("z=",z[50])
This prints out 100 values of x,y and z arrays of 50 index. Where "z" is the density and "xx" and "yy" are the values adjusted according to the bandwidth, cut and clip, in a meshgrid form distributed according to grid size, that were given by the user. This gave me some idea about the actual values of the 2D kde plot.
If you print out entire array of each variable then you will get 100 x 100 values of each.

What is the best way to create a figure with distributions as insets in python?

I created a graph in MATLAB (see figure below) such that around every data point there is a data distribution plotted (grey area plots). The way I did it in MATLAB was to create a set of axes for every distribution curve and then plot the curves without showing those axes at every point of the data curve. I also used a command 'linkaxes' to set figure limits for all the curves at once.
I must say that this is far from an elegant solution and I had many troubles with saving this figure in the correct aspect ratio settings. All in all I couldn't find any other useful option in MATLAB.
Is there a more elegant solution for such types of graphs in Python? I am not that much interested in how to do the areas highlighted, but how to place a set of curves(distributions) exactly at the positions of the main data curve points.
Thank you!

Linear Regression from a .csv file in matplotlib

Can someone explain how to make a scatter plot and linear regression from an excel file?
I know how to import the the file with pandas, I know how to do a scatter plot by plugging in my own data in matplotlib, but I don't know how to make python do all three from the file.
Ideally it would also give r value, p value, std error, slope and intercept.
I'm very new to all of this and any help would be great.
I've searched around stack overflow, reddit, and else where, but I haven't found anything recent.
SciPy has a basic linear regression function that fits your criteria: scipy.stats.linregress Just use the appropriate columns from your DataFrame as x and y.
Pyplot's basic plt.plot(x, y) function will give you a line: matplotlib.pyplot.plot. You can compute a set of y values using the slope and intercept.

How can I work around overflow error in matplotlib?

I'm solving a set of coupled differential equations with odeint package from scipy.integrate.
For the integration time I have:
t=numpy.linspace(0,8e+9,5e+06)
where 5e+06 is the timestep.
I then plot the equations I have as such:
plt.xscale('symlog') #x axis logarithmic scale
plt.yscale('log',basey=2) #Y axis logarithmic scale
plt.gca().set_ylim(8, 100000) #Changing y axis ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(matplotlib.ticker.ScalarFormatter())
ax.xaxis.set_major_formatter(matplotlib.ticker.ScalarFormatter())
plt.title("Example graph")
plt.xlabel("time (yr)")
plt.ylabel("quantity a")
plt.plot(t,a,"r-", label = 'Example graph')
plt.legend(loc='best')
where a is time dependent variable. (This is just one graph from many.)
However, the graphs look a bit jagged, rather than oscillatory and I obtain this error:
OverflowError: Exceeded cell block limit (set 'agg.path.chunksize' rcparam)
I'm not overly sure what this error means, I've looked at other answers but don't know how to implement the 'agg.path.chunksize'.
Also, the integration + plotting takes around 7 hours and that is with some CPU processing hacks, so I really do not want to implement anything that would increase the time.
How can I overcome this error?
I have attempted to reduce the timestep, however I obtain this error instead:
Excess work done on this call (perhaps wrong Dfun type).
Run with full_output = 1 to get quantitative information.
As the error message suggests, you may set the chunksize to a larger value.
plt.rcParams['agg.path.chunksize'] = 1000
However you may also critically reflect why this error occurs in the first place. It would only occur if you are trying to plot an unreasonably large amount of data on the graph. Meaning, if you try to plot 200000000 points, the renderer might have problems to keep them all in memory. But one should probably ask oneself, why is it necessary to plot so many points? A screen may display some 2000 points in lateral direction, a printed paper maybe 6000. Using more points does not make sense, generally speaking.
Now if the solution of your differential equations requires a large point density, it does not automatically mean that you need to plot them all.
E.g. one could just plot every 100th point,
plt.plot(x[::100], y[::100])
most probably without even affecting the visual plot appearance.

Frequency distribution graph

Is there a way to draw a frequency distribution graph in python or R?
In R, using histograms, which show frequency on y axis vs some categorization on x-axis as in your example.
hist() function at the very least help you plot one vector (a set of values). ?hist for brief documentation, also search this site
how to plot two vectors side by side, similar to your posted example, an example is at http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/ , scroll down to Histogram and density plots with multiple groups

Categories

Resources