Seaborn distplot: y axis problems with multiple kdeplots - python

I am currently plotting 3 kernel density estimations together on the same graph. I assume that kdeplots use relative frequency as the y value, however for some of my data the kdeplot has frequencies way above 1.
code I'm using:
sns.distplot(data1, kde_kws={"color": "b", "lw": 1.5, "shade": "False", "kernel": "gau", "label": "t"}, hist=False)
Does anyone know how I can make sure that the kdeplot either makes y value relative frequency, or allow me to adjust the ymax axis limit automatically to the maximum frequency calculated?

Okay so I figured out that I just needed to set the autocaling to Tight, that way it didn't give negative values on the scale.

Related

Set axis limits in matplotlib but autoscale within them

Is it possible to set the max and min values of an axis in matplotlib, but then autoscale when the values are much smaller than these limits?
For example, I want a graph of percentage change to be limited between -100 and 100, but many of my plots will be between, say, -5 and 5. When I use ax.set_ylim(-100, 100), this graph is very unclear.
I suppose I could use something like ax.set_ylim(max((-100, data-n)), min((100, data+n))), but is there a more built in way to achieve this?
If you want to drop extreme outliers you could use the numpy quantile function to find say the first 0.001 % of the data and last 99.999 %.
near_min = np.quantile(in_data, 0.0001)
near_max = np.quantile(in_data, 0.9999)
ax.set_ylim(near_min, near_max)
You will need to adjust the quantile depending on the volume of data you drop. You might want to include some test of whether the difference between near_min and true min is significant?
As ImportanceOfBeingErnest pointed out, there is no support for this feature. In the end I just used my original idea, but scaled it by the value of the max and min to give the impression of autoscale.
ax.set_ylim(max((-100, data_min+data_min*0.1)), min((100, data_max+data_max*0.1)))
Where for my case it is true that
data_min <= 0, data_max >= 0
Why not just set axis limits based on range of the data each time plot is updated?
ax.set_ylim(min(data), max(data))
Or check if range of data is below some threshold, and then set axis limits.
if min(abs(data)) < thresh:
ax.set_ylim(min(data), max(data))

Python fastKDE beyond limits of data points

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
#!python
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
PP.contour(v1,v2,myPDF)
PP.show()
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
Cheers
I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.
I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

Heatmap with varying y axis

I would like to create a visualization like the upper part of this image. Essentially, a heatmap where each point in time has a fixed number of components but these components are anchored to the y axis by means of labels (that I can supply) rather than by their first index in the heatmap's matrix.
I am aware of pcolormesh, but that does not seem to give me the y-axis functionality I seek.
Lastly, I am also open to solutions in R, although a Python option would be much preferable.
I am not completely sure if I understand your meaning correctly, but by looking at the picture you have linked, you might be best off with a roll-your-own solution.
First, you need to create an array with the heatmap values so that you have on row for each label and one column for each time slot. You fill the array with nans and then write whatever heatmap values you have to the correct positions.
Then you need to trick imshow a bit to scale and show the image in the correct way.
For example:
# create some masked data
a=cumsum(random.random((20,200)), axis=0)
X,Y=meshgrid(arange(a.shape[1]),arange(a.shape[0]))
a[Y<15*sin(X/50.)]=nan
a[Y>10+15*sin(X/50.)]=nan
# draw the image along with some curves
imshow(a,interpolation='nearest',origin='lower',extent=[-2,2,0,3])
xd = linspace(-2, 2, 200)
yd = 1 + .1 * cumsum(random.random(200)-.5)
plot(xd, yd,'w',linewidth=3)
plot(xd, yd,'k',linewidth=1)
axis('normal')
Gives:

Using fill_between() with a Pandas Data Series

I have graphed (using matplotlib) a time series and its associated upper and lower confidence interval bounds (which I calculated in Stata). I used Pandas to read the stata.csv output file and so the series are of type pandas.core.series.Series.
Matplotlib allows me to graph these three series on the same plot, but I wish to shade between the upper and lower confidence bounds to generate a visual confidence interval. Unfortunately I get an error, and the shading doesn't work. I think this is to do with the fact that the functions between which I wish to fill are pandas.core.series.Series.
Another post on here suggests that passing my_series.value instead of my_series will fix this problem; however I cannot get this to work. I'd really appreciate an example.
As long as you don't have NaN values in your data, you should be okay:
In [78]: x = Series(linspace(0, 2 * pi, 10000))
In [79]: y = sin(x)
In [80]: fill_between(x.values, y.min(), y.values, alpha=0.5)
Which yields:

Scale a matplotlib plot so that small/large positive/negative differences can be shown

This plot is supposed to show differences in time, which can be both negative and positive values. Some differences are very small, while others are very large.
Can I scale the x-axis so that the resolution is very fine near x = 0 and coarse farther away from x = 0? Is it possible to have a logarithmic scale going outward from x = 0?
EDIT:
As suggested by #Evert, this solves the problem for me:
ax = gca()
...
ax.set_xscale("symlog")
and produces this plot:
You can use the symlog setting in xscale(): http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.xscale
It scales logarithmically (also on the negative side), apart from a limited section around zero (which can be set using further keywords, see the documentation): that section is scaled linearly, thus avoiding all log(0) problems.
See here for an example.
I would make two subplots: plot the positive times in the right-hand subplot, and plot abs(negative times) in the left-hand subplot, with a reversed x-axis.
Is it possible to have a logarithmic scale going outward from x = 0?
No, because a logarithmic plot doesn't show zero --- as you approach the "left end" of the log-x axis, you go to negative infinity in log space, so you can't cross zero to get to the truly negative values. You have to cut zero out somehow.

Categories

Resources