Seaborn: Violinplot experiences difficulty with too many variables? - python

I wanted to use seaborn to visualize my entire Pandas dataframe with violinplots, and I thought I had made the necessary corrections to generate a large graph for the sizable number of 270 variables my dataframe possessed.
However, no matter what I do, the violinplots only display their inner mini-boxplots (as another question here describes) for each variable, and not their kde's:
fig, ax = plt.subplots(figsize=(50,5))
ax.set_ylim(-6, 6)
a = sns.violinplot(x='variable', y='value', data=pd.melt(train_norm), ax=ax)
a.set_xticklabels(a.get_xticklabels(), rotation=90);
plt.savefig('massive_violinplot.png', figsize=(50,5), dpi=220)
(apologies for the cropped graph, the whole thing is too big to post)
Whereas the following code, using the same pd.Dataframe, but only showing the first six variables, displays correctly:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_ylim(-6, 6)
a = sns.violinplot(x='variable', y='value', data=pd.melt(train_norm.iloc[:,:6]), ax=ax)
a.set_xticklabels(a.get_xticklabels(), rotation=90);
plt.savefig('massive_violinplot.png', figsize=(10,5), dpi=220)
How could I get a graph like the above for all the variables, filled with proper violinplots showing their kde's?

This is not related to the number of variables or the plot size but to the huge differences in the distributions of the variables. I can't access your data right now so I will ilustrate it with a made up dataset. You can follow along with your dataset, selecting the three variables with more dispersion and the three with less dispersion. As a dispersion measurement you can use the variance or even the data range (if you don't have crazy long tails) or something different, I am not sure what would work better.
rs = np.random.RandomState(42)
data = rs.randn(100, 6)
data[:, :3] *= 20
df = pd.DataFrame(data)
See what happens if we plot the density with common axes so they are directly comparable.
df.plot(kind='kde', subplots=True, layout=(3, 2), sharex=True, sharey=True)
plt.tight_layout()
This is more or less the same you can see in the seaborn violin plot but of course transposed.
sns.violinplot(x='variable', y='value', data=pd.melt(df))
This is usually great for comparing the variables because you can look at the differences in width as differences in density. Unfortunately the violin for the variables with more dispersion are so narrow that you can't see the width at all and you lose any sense of the shape. On the other hand the variables with less dispersion appear too short (actually in your dataset some of them are just horizontal lines).
For the first problem you can make the violins use all the available horizontal space by using scale='width' but then you no longer can compare the density across variables. The width is the same at the peaks but the density is not.
sns.violinplot(x='variable', y='value', data=pd.melt(df), scale='width')
By the way, this is what matplotlib's violin plot does by default.
plt.violinplot(df.T)
For the second problem I think your only option is to normalize or standardize the variables in some way.
sns.violinplot(x='variable', y='value', data=pd.melt((df - df.mean()) / df.std()))
Now you have a clearer view of each variable separately (how many modes they have, how skewed they are, how long the tails are...) but you can compare neither the scale nor the dispersion across variables.
The moral of the story is that you can't see everything at once, you have to pick and choose depending on what you are looking for in the data.

Related

Python stacked barchart where y-axis scale is linear but the bar fill is logarithmic in the order of 10s

As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!

Set a common reasonable scale to display heat map suplots with a single legend

I am using geoplot's kdeplot function to display Kernel Density maps (aka heatmaps), for different periods of time.
I have a code that looks like this:
fig, axs = plt.subplots(n_rows, n_cols, figsize=(20,10))
for ax, t in zipped(axs.flatten(), periods):
# df is a GeoPandas dataframe
data = df.loc[(df['period'] == t), ['id', 'geometry']]
# heatmap plot
gplt.kdeplot(
data,
clip=africa.geometry,
cmap='Reds',
shade=True,
cbar=True,
ax=ax)
gplt.polyplot(africa, ax=ax, zorder=1)
ax.set_title(int(t))
It outputs the following image
i would like instead to be able to define a common scale for my entire dataset (regardless of the time), which I can then use in kdeplot and as a unique legend for my subplots.
I know that my data have different density in different years, but I am trying to find a sort of common values that can be used for each of them.
I thought the levels parameter would be what I am looking for (i.e. using the same iso-proportions of the density for my periods, e.g. [0.2,0.4,0.6,0.8,1]).
However, when I use it in combination with cbar=True to display the legends, the values of each legend is different from the other legends (and from the levels vector).
Am I doing something wrong?
If not, do I need to manually set the cbar?

Python, Seaborn: Logarithmic Swarmplot has unexpected gaps in the swarm

Let's look at a swarmplot, made with Python 3.5 and Seaborn on some data (which is stored in a pandas dataframe df with column lables stored in another class. This does not matter for now, just look at the plot):
ax = sns.swarmplot(x=self.dte.label_temperature, y=self.dte.label_current, hue=self.dte.label_voltage, data = df)
Now the data is more readable if plotted in log scale on the y-axis because it goes over some decades.
So let's change the scaling to logarithmic:
ax.set_yscale("log")
ax.set_ylim(bottom = 5*10**-10)
Well I have a problem with the gaps in the swarms. I guess they are there because they have been there when the plot is created with a linear axis in mind and the dots should not overlap there. But now they look kind of strange and there is enough space to from 4 equal looking swarms.
My question is: How can I force seaborn to recalculate the position of the dots to create better looking swarms?
mwaskom hinted to me in the comments how to solve this.
It is even stated in the swamplot doku:
Note that arranging the points properly requires an accurate transformation between data and point coordinates. This means that non-default axis limits should be set before drawing the swarm plot.
Setting an existing axis to log-scale and use this for the plot:
fig = plt.figure() # create figure
rect = 0,0,1,1 # create an rectangle for the new axis
log_ax = fig.add_axes(rect) # create a new axis (or use an existing one)
log_ax.set_yscale("log") # log first
sns.swarmplot(x=self.dte.label_temperature, y=self.dte.label_current, hue=self.dte.label_voltage, data = df, ax = log_ax)
This yields in the correct and desired plotting behaviour:

is it possible to use a non gaussian kernel for the two lateral distributions in seaborn jointplot

My data look like:
s1 = sns.jointplot(data.columns[i],
data.columns[j],
data=data,
space=0, color="b", stat_func=None)
if I use kde instead
s1 = sns.jointplot(data.columns[i],
data.columns[j],
data=data, kind = 'kde',
space=0, color="b", stat_func=None)
I am quite happy with the two dimensional kde interpolation, less with the lateral one. The problem is both placed so close together actually suggest the maximum of the distribution lying at two different points which might be quite misleading.
So now the actual question: is it possible to specify something different from gaussian as a kernel (blue) for the two lateral distributions? (I know that gaussian is thw only option in 2D). Because for example 'biw' (green) might esthetically look better (I am still not convinced that it is mathematically speaking a good thing to place the interpolations done with the different kernel close together making them seem the same thing...). So my question is whether I can specify the different kernel somewhere in sns.jointplot or is the only way to overwrite the lateral distribution by anotherone calculated in a second moment.
ax1 = sns.distplot(data[data.columns[j]])
sns.kdeplot(data[data.columns[j]], kernel= 'biw', ax = ax1)
You can set a different kernel for the marginal plots:
s1 = sns.jointplot(data.columns[i],
data.columns[j],
data=data, kind = 'kde',
space=0, color="b", stat_func=None,
marginal_kws={"kernel":"biw"}) # like this
or, if you want to change just one marginal plot, you can replot on them:
s1.ax_marg_y.cla() # clear axis
sns.kdeplot(data.y, ax=s1.ax_marg_y, # choose the ax
kernel="biw", # choose your kernel
legend=0, # remove the legend
vertical=True) # swap axis
vertical=True allows you to switch x and y axis, ie not needed if you change the top-margin plot.

Displaying 3 histograms on 1 axis in a legible way - matplotlib

I have produced 3 sets of data which are organised in numpy arrays. I'm interested in plotting the probability distribution of these three sets of data as normed histograms. All three distributions should look almost identical so it seems sensible to plot all three on the same axis for ease of comparison.
By default matplotlib histograms are plotted as bars which makes the image I want look very messy. Hence, my question is whether it is possible to force pyplot.hist to only draw a box/circle/triangle where the top of the bar would be in the default form so I can cleanly display all three distributions on the same graph or whether I have to calculate the histogram data and then plot it separately as a scatter graph.
Thanks in advance.
There are two ways to plot three histograms simultaniously, but both are not what you've asked for. To do what you ask, you must calculate the histogram, e.g. by using numpy.histogram, then plot using the plot method. Use scatter only if you want to associate other information with your points by setting a size for each point.
The first alternative approach to using hist involves passing all three data sets at once to the hist method. The hist method then adjusts the widths and placements of each bar so that all three sets are clearly presented.
The second alternative is to use the histtype='step' option, which makes clear plots for each set.
Here is a script demonstrating this:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(101)
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
c = np.random.normal(size=1000)
common_params = dict(bins=20,
range=(-5, 5),
normed=True)
plt.subplots_adjust(hspace=.4)
plt.subplot(311)
plt.title('Default')
plt.hist(a, **common_params)
plt.hist(b, **common_params)
plt.hist(c, **common_params)
plt.subplot(312)
plt.title('Skinny shift - 3 at a time')
plt.hist((a, b, c), **common_params)
plt.subplot(313)
common_params['histtype'] = 'step'
plt.title('With steps')
plt.hist(a, **common_params)
plt.hist(b, **common_params)
plt.hist(c, **common_params)
plt.savefig('3hist.png')
plt.show()
And here is the resulting plot:
Keep in mind you could do all this with the object oriented interface as well, e.g. make individual subplots, etc.

Categories

Resources