seaborn tsplot with non-connected confidence intervals - python

I'm using seaborn's tsplot function to plot how well my model fit matches actual data in a time series, with CIs showing my predictions' standard deviations. My question is: Is there a way for tsplot not to fill in CIs between points? That is, for it to show the CIs of each point individually without connecting one CI to the next.
For the means this is accomplished by setting "interpolate" to False. I'm looking the same -- but for CIs.
To illustrate, my plots currently look like this:
I'm fine with how this looks for means (red dots) that are close together, but the CI-transition looks rather odd when one mean is close to 1 and the next is close to 0. The data just happens to be like this. I'd be happy to turn the CI "connection" off, but would also be happy for any related aesthetic suggestions. Thank you.
For completeness' sake, the relevant offending code fragment is as follows:
import seaborn as sns; sns.set(color_codes=True)
import matplotlib.pyplot as plt
model_fit = #fit data
data = #actual data
sns.tsplot(model_fit,interpolate=False,ci='sd',color='indianred',condition='predicted')
plt.plot(X,actual_data ,linestyle='None',marker='*',label='actual')

Related

marker style by third variable

Might seem like a repeat question, but the solution in this post doesn't seem to work for me.
I have a bunch of data I want to plot as lines/curves, and another dataset linked to the curves consisting of XYZ data, where Z represents a labeling variable for the curves.
I've got some example code here with some XY data, and labels for anyone wanting to replicate what I'm doing:
plt.plot(xdata, ydata)
plt.scatter(xlab, ylab, c=lab) # needs a marker function adding
plt.show()
Ideally I want to add some kind of unique marker based on the label values; 0.1,0.5,1,2,3,4,6,8,10,20. The labels are the same for each curve.
I have over 100 curves to plot, so something quick and effective is needed. Any help would be great!
My current solution would be to just split the data by labelling values, and then plot separately for each one (long and messy in my opinion). Figured someone might have a more elegant solution here.
I'm guessing you could do this with a dictionary... but I might need some help doing that!
Cheers, KB
Matplotlib does not accepts different markers per plot.
However, a less verbose and more robust solution for large dataset is using the pandas and seaborn library:
Additionally you can use the pandas.cut function to plot bins (Its something I regularly need to produce graphs where I can use a third continuous value as a parameter). The way to use it is :
import pandas as pd
import seaborn as sns
url = 'https://pastebin.com/raw/dwGBLqSb' # url of paste
df = pd.read_csv(url)
sns.scatterplot(data = df, x='labx', y='laby', style='lab')
and it produces the following example:
If you have something more advanced labelling you could also look at LabelEncoder of Sklearn.
Hopefully, I've edited enough this answer not to offend don't post identical answers to multiple questions. For what is worth, I am not affiliated with seaborn library in any way nor am I trying to promote anything. The only thing I am trying to do is help someone with a similar problem that I've come across and I couldn't find easily a clear answer in SE.

Set confidence intervals in seaborn 2D kdeplot #2

I plot a 2D KDE with seaborn with:
ax = sns.kdeplot(scatter_all["s_zscore"], scatter_all["p_zscore"])
I want my levels of the density estimation to be meaningful, ie. I wan to mark confidence intervals. Basically I would like to obtain something very close to:
this answer but the data are not normalized and it has to stay that way.
Could someone please provide me an explanation where, how and why should I change the calculations for the levels? I am looking for a clear statistical explanation as said in my comment below.

Rotating parallel coordinate axis-names in Pandas

When using some of the built in visualization tools in Pandas, one that is very helpful for me is the parallel_coordinates visualization. However, since I have around 18 features in the dataframe, the bottom of the parallel_coords plot gets really messy.
Therefore, I was wondering if anyone knew how to rotate the axis-names to be vertical rather than horizontal as shown here:
I did find a way to use parallel_coords in a polar set up, creating a radar-chart; while that was helpful for getting the different features to be visible, that solution doesn't quite work since whenever the values are close to 0, it becomes almost impossible to see the curve. Furthermore, doing it with the polar coord frame required me to break from using pandas' dataframe which is part of what made the this method so appealing.
Use plt.xticks(rotation=90) should be enough. Here is an example with the “Iris” dataset:
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import parallel_coordinates
data = pd.read_csv('iris.csv')
parallel_coordinates(data, 'Name')
plt.xticks(rotation=90)
plt.show()

How to better fit seaborn violinplots?

The following code gives me a very nice violinplot (and boxplot within).
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
foo = np.random.rand(100)
sns.violinplot(foo)
plt.boxplot(foo)
plt.show()
So far so good. However, when I look at foo, the variable does not contain any negative values. The seaborn plot seems misleading here. The normal matplotlib boxplot gives something closer to what I would expect.
How can I make violinplots with a better fit (not showing false negative values)?
As the comments note, this is a consequence (I'm not sure I'd call it an "artifact") of the assumptions underlying gaussian KDE. As has been mentioned, this is somewhat unavoidable, and if your data don't meet those assumptions, you might be better off just using a boxplot, which shows only points that exist in the actual data.
However, in your response you ask about whether it could be fit "tighter", which could mean a few things.
One answer might be to change the bandwidth of the smoothing kernel. You do that with the bw argument, which is actually a scale factor; the bandwidth that will be used is bw * data.std():
data = np.random.rand(100)
sns.violinplot(y=data, bw=.1)
Another answer might be to truncate the violin at the extremes of the datapoints. The KDE will still be fit with densities that extend past the bounds of your data, but the tails will not be shown. You do that with the cut parameter, which specifies how many units of bandwidth past the extreme values the density should be drawn. To truncate, set it to 0:
sns.violinplot(y=data, cut=0)
By the way, the API for violinplot is going to change in 0.6, and I'm using the development version here, but both the bw and cut arguments exist in the current released version and behave more or less the same way.

Python, matplotlib: how to set tick label values to their logarithmic values

I have some data that I plot on a semi-log plot (log-lin style, with a logarithmic scale on the y-axis). Is there a way to change the y-axis tick labels from their actual values to their logarithmic values?
As an example, consider the following code:
import matplotlib.pyplot as plt
import numpy as np
x=np.array([1,2,3,4,5])
def f(x):
return 10**(x-1)
plt.plot(x,f(x))
plt.yscale(u'log')
plt.show()
Which produces the following plot:
(Sorry it is kind of big, I do not know how to make it smaller, feel free to edit to help out with that).
In this plot the tick labels are shown as 10^0, 10^1, 10^2, etc.; however I would like them to display as their logarithmic values: 0, 1, 2, etc.
I realize I could go back and change plt.plot(x,f(x)) to plt.plot(x,np.log10(f(x))) and then make the y-axis linear again instead of logarithmic but I want to know if there is anyway matplotlib can just change the y-axis tick values themselves without me having to put np.log10() in all my plt.plot()'s. My reason for this is two-fold: I have many plt.plot() lines in my code and would rather not go back and have to change it for all of them, and then I wouldn't have logarithmically spaced minor ticks (although I'm sure there's some way to change that even with a linear axis).
EDIT: I am aware of this question which has some similarities to mine but is not the same. The person in that question wants to change the tick labels from scientific form to "normal" decimal form. I want to change my tick labels from scientific form to the logarithmic (base 10) value of the number. I am sure the answer will be similar to the one I linked but it is not obvious to me how to do it. In fact, I looked at that question before posting mine but still decided to post mine because I did not know how to apply it to my problem. Perhaps to experienced programmers it is obvious how to apply the methods of the question I linked to my situation but it isn't obvious to me so please step me through it.
If you could show me a code sample (by copying my code sample and putting in the necessary lines) how this works I would much appreciate it.
You can use a custom formatter, for example:
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
import math
x=np.array([1,2,3,4,5])
def f(x):
return 10**(x-1)
plt.plot(x,f(x))
plt.yscale(u'log')
#SET CUSTORM TICK FORMATTING
plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x,y: '{}'.format(math.log(x, 10))))
plt.show()

Categories

Resources