Superimposing plots in seaborn cause x-axis to misallign - python

I am having an issue trying to superimpose plots with seaborn. I am able to generate the two plots separetly as
fig, (ax1,ax2) = plt.subplots(ncols=2,figsize=(30, 7))
sns.lineplot(data=data1, y='MSE',x='pct_gc',ax=ax1)
sns.boxplot(x="pct_gc", y="MSE", data=data2,ax=ax2,width=0.4)
The output looks like this:
But when i try to put both plots superimposed, but assiging both to the same ax object.
fig, (ax1,ax2) = plt.subplots(ncols=2,figsize=(30, 7))
sns.lineplot(data=data1, y='MSE',x='pct_gc',ax=ax1)
sns.boxplot(x="pct_gc", y="MSE", data=data2,ax=ax2,width=0.4)
I am not able to identify with the X axis in the Lineplot changes when superimposing both plots (both plots X axis go from 0 to 0.069).
My goal is for both plots to be superimposed, while keeping the same X axis range.

Seaborn's boxplot creates categorical x-axis, with all boxes nicely with the same distance. Internally the x-axis is numbered as 0, 1, 2, ... but externally it gets the labels from 0 to 0.069.
To combine a line plot with a boxplot, matplotlib's boxplot can be addressed directly, so that positions and widths can be set explicitly. When patch_artist=True, a rectangle is created (instead of just lines), for which a facecolor can be given. manage_ticks=False prevents that boxplot changes the x ticks and their limits. Optionally notch=True would accentuate the median a bit more, but depending on the data, the confidence interval might be too large and look weird.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
data1 = pd.DataFrame({'pct_gc': np.linspace(0, 0.069, 200), 'MSE': np.random.normal(0.02, 0.1, 200).cumsum()})
data1['pct_range'] = pd.cut(data1['pct_gc'], 10)
fig, ax1 = plt.subplots(ncols=1, figsize=(20, 7))
sns.lineplot(data=data1, y='MSE', x='pct_gc', ax=ax1)
for interval, color in zip(np.unique(data1['pct_range']), plt.cm.tab10.colors):
ax1.boxplot(data1[data1['pct_range'] == interval]['MSE'],
positions=[interval.mid], widths=0.4 * interval.length,
patch_artist=True, boxprops={'facecolor': color},
notch=False, medianprops={'color':'yellow', 'linewidth':2},
manage_ticks=False)
plt.show()

Related

Overlaying Pandas plot with Matplotlib is sensitive to the plotting order

I have the following problem: I'm trying to overlay two plots: One Pandas plot via plot.area() for a dataframe, and a second plot that is a standard Matplotlib plot. Depending the coder order for those two, the Matplotlib plot is displayed only if the code is before the Pandas plot.area() on the same axes.
Example: I have a Pandas dataframe called revenue that has a DateTimeIndex, and a single column with "revenue" values (float). Separately I have a dataset called projection with data along the same index (revenue.index)
If the code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Pandas area plot
revenue.plot.area(ax = ax)
# Second -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
plt.tight_layout()
plt.show()
Then the only thing displayed is the pandas plot.area() like this:
1/ Pandas plot.area() and 2/ Matplotlib line plot
However, if the order of the plotting is reversed:
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
# Second -- Pandas area plot
revenue.plot.area(ax = ax)
plt.tight_layout()
plt.show()
Then the plots are overlayed properly, like this:
1/ Matplotlib line plot and 2/ Pandas plot.area()
Can someone please explain me what I'm doing wrong / what do I need to do to make the code more robust ? Kind TIA.
The values on the x-axis are different in both plots. I think DataFrame.plot.area() formats the DateTimeIndex in a pretty way, which is not compatible with pyplot.plot().
If you plot of the projection first, plot.area() can still plot the data and does not format the x-axis.
Mixing the two seems tricky to me, so I would either use pyplot or Dataframe.plot for both the area and the line:
import pandas as pd
from matplotlib import pyplot as plt
projection = [1000, 2000, 3000, 4000]
datetime_series = pd.to_datetime(["2021-12","2022-01", "2022-02", "2022-03"])
datetime_index = pd.DatetimeIndex(datetime_series.values)
revenue = pd.DataFrame({"value": [1200, 2200, 2800, 4100]})
revenue = revenue.set_index(datetime_index)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# Option 1: only pyplot
ax[0].fill_between(revenue.index, revenue.value)
ax[0].plot(revenue.index, projection, color='black', linewidth=3)
ax[0].set_title("Pyplot")
# Option 2: only DataFrame.plot
revenue["projection"] = projection
revenue.plot.area(y='value', ax=ax[1])
revenue.plot.line(y='projection', ax=ax[1], color='black', linewidth=3)
ax[1].set_title("DataFrame.plot")
The results then look like this, where DataFrame.plot gives a much cleaner looking result:
If you do not want the projection in the revenue DataFrame, you can put it in a separate DataFrame and set the index to match revenue:
projection_df = pd.DataFrame({"projection": projection})
projection_df = projection_df.set_index(datetime_index)
projection_df.plot.line(ax=ax[1], color='black', linewidth=3)

Categorical bubble plot in Python

I have a dataset with a lot of categorical variables and a binary target variable. What package is available in Python or other opensource GUI-based software where I can scatterplot two categorical variables on the X and Y axis and use the target variable as hue?
I have looked at Seaborn's catplot, but for that, one axis has to be numerical while the other categorical. So it doesn't serve this case.
For example, you can use the following:
import seaborn as sns
data = sns.load_dataset('titanic')
Here are the plot features I want
X-axis - 'embark_town'
Y-axis - 'class'
hue - 'alive'
I am of the opinion that if you have to rearrange a seaborn graph substantially, you can also create this graph from scratch with matplotlib. This gives us the opportunity to have a different approach to display this categorical vs categorical plot:
import matplotlib.pyplot as plt
from matplotlib.markers import MarkerStyle
import numpy as np
#dataframe and categories
import seaborn as sns
df = sns.load_dataset('titanic')
X = "embark_town"
Y = "class"
H = "alive"
bin_dic = {0: "yes", 1: "no"}
#counting the X-Y-H category entries
plt_df = df.groupby([X, Y, H]).size().to_frame(name="vals").reset_index()
#figure preparation with grid and scaling
fig, ax = plt.subplots(figsize=(9, 6))
ax.set_ylim(plt_df[Y].unique().size-0.5, -0.5)
ax.set_xlim(-0.5, plt_df[X].unique().size+1.0)
ax.grid(ls="--")
#upscale factor for scatter marker size
scale=10000/plt_df.vals.max()
#left marker for category 0
ax.scatter(plt_df[plt_df[H]==bin_dic[0]][X],
plt_df[plt_df[H]==bin_dic[0]][Y],
s=plt_df[plt_df[H]==bin_dic[0]].vals*scale,
c=[(0, 0, 1, 0.5)], edgecolor="black", marker=MarkerStyle("o", fillstyle="left"),
label=bin_dic[0])
#right marker for category 1
ax.scatter(plt_df[plt_df[H]==bin_dic[1]][X],
plt_df[plt_df[H]==bin_dic[1]][Y],
s=plt_df[plt_df[H]==bin_dic[1]].vals*scale,
c=[(1, 0, 0, 0.5)], edgecolor="black", marker=MarkerStyle("o", fillstyle="right"),
label=bin_dic[1])
#legend entries for the two categories
l = ax.legend(title="Survived the catastrophe", ncol=2, framealpha=0, loc="upper right", columnspacing=0.1,labelspacing=1.5)
l.legendHandles[0]._sizes = l.legendHandles[1]._sizes = [800]
#legend entries representing sizes
bubbles_n=5
bubbles_min = 50*(1+plt_df.vals.min()//50)
bubbles_step = 10*((plt_df.vals.max()-bubbles_min)//(10*(bubbles_n-1)))
bubbles_x = plt_df[X].unique().size+0.5
for i, bubbles_y in enumerate(np.linspace(0.5, plt_df[Y].unique().size-1, bubbles_n)):
#plot each legend bubble to indicate different marker sizes
ax.scatter(bubbles_x,
bubbles_y,
s=(bubbles_min + i*bubbles_step) * scale,
c=[(1, 0, 1, 0.6)], edgecolor="black")
#and label it with a value
ax.annotate(bubbles_min+i*bubbles_step, xy=(bubbles_x, bubbles_y),
ha="center", va="center",
fontsize="large", fontweight="bold", color="white")
plt.show()
Seaborn supports, just like matplotlib, the plotting of categorical vs categorical variables. One can create semitransparent markers that allow to see both categories, although this might be difficult to distinguish from one marker if both are of similar size. The essential plot is rather easy - we transform the dataframe with groupby and size to count the entries per triplet embarking town - class - alive category, then create a scatterplot with count value as markersize. However, the legend entry is the complicated part here. Either the markersize is tiny in the plot or massive in the legend. I tried to balance this but I am not happy with the result. A lot of manual adjusting necessary here, so seaborn is no real advantage here. Any suggestions on how to simplify this within seaborn are welcome.
import seaborn as sns
import matplotlib.pyplot as plt
#dataframe and categories
df = sns.load_dataset('titanic')
X = "embark_town"
Y = "class"
H = "alive"
#counting the X-Y-H category entries
plt_df = df.groupby([X, Y, H]).size().to_frame(name="people").reset_index()
#figure preparation with grid and scaling
fig, ax = plt.subplots(figsize=(6,4))
ax.set_ylim(plt_df[Y].unique().size-0.5, -0.5)
ax.set_xlim(-0.5, plt_df[X].unique().size+1.0)
ax.grid(ls="--")
#the actual scatterplot with markersize representing the counted values
sns.scatterplot(x=X,
y=Y,
size="people",
sizes=(100, 10000),
alpha=0.5,
edgecolor="black",
hue=H,
data=plt_df,
ax=ax)
#creating two legends because the hue markers differ in size from the others
handles, labels = ax.get_legend_handles_labels()
l = ax.legend(handles[:3], labels[:3], title="The poor die first", markerscale=2, loc="upper right")
ax.add_artist(l)
#and seaborn plots the size markers in black, so you would get massive black blobs in the legend
#we change the color and make them transparent
for handle in handles:
handle.set_facecolors((0, 1, 1, 0.5))
ax.legend(handles[4::2], labels[4::2], title="N° of people", loc="lower right", handletextpad=4, labelspacing=3, markerfirst=False)
plt.tight_layout()
plt.show()
Sample output:

Using percentiles of a timeseries to set colour gradient in Python's matplotlib

I have a time series which will have over 10,000 daily values of a variable over the course of a year array size (365, 10000). Because I will have so much data (many time series for many variables), I was hoping to save only the percentiles (0, 10, 20,..., 90, 100) and use these later in plots to set a color gradient showing the density of values (obviously being darkest at the median and lightest at the min and max). The purpose of this is to avoid excessive file sizes in the saved simulation outputs, since I'll have millions of outputs to process. This would reduce the file sizes significantly if I can get it to work.
I was able to compute the percentiles of a sample data set (just using 50 values for now) and plot them as shown in the attached figure (using an array with size 365,11). How would I use this information to then set up a plot showing the colour gradient (or density of values)? Is this possible? Or is there some other way of going about it? I'm using matplotlib...
import numpy as np
import matplotlib.pyplot as plt
SampleData=(375-367)*np.random.random_sample((365, 50))+367
SDist=np.zeros((365,11))
for i in range(11):
for t in range(365):
SDist[t,i]=np.percentile(SampleData[t,:],i*10)
fig, (ax1) = plt.subplots(nrows=1, ncols=1, sharex=True, figsize=(8,4))
ax1.plot(np.arange(0,365,1), SDist)
ax1.set_title("SampleData", fontsize=15)
ax1.tick_params(labelsize=11.5)
ax1.set_xlabel('Day', fontsize=14)
ax1.set_ylabel('SampleData', fontsize=14)
fig.tight_layout()
EDIT
Here is a good example of what I'm going for (though obviously it will look different with my sample data) - I think it's similar to a fan chart:
You can use a matplotlib cm object to get the colormaps and manually calculate the color to plot based on a value. The below example calculates the color to plot based on line index (0-11). However, you can calculate the color based on anything, such as number of observations used to calculate the percentile, so long as you plot them individually and call the correct color value.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
n = 11 # change this value for the number of iterations/percentiles
colormap = cm.Blues # change this for the colormap of choice
percentiles = np.linspace(0,100,n)
SampleData=(375-367)*np.random.random_sample((365, 50))+367
SDist=np.zeros((365,n))
for i in range(n):
for t in range(365):
SDist[t,i]=np.percentile(SampleData[t,:],percentiles[i])
half = int((n-1)/2)
fig, (ax1) = plt.subplots(nrows=1, ncols=1, sharex=True, figsize=(8,4))
ax1.plot(np.arange(0,365,1), SDist[:,half],color='k')
for i in range(half):
ax1.fill_between(np.arange(0,365,1), SDist[:,i],SDist[:,-(i+1)],color=colormap(i/half))
ax1.set_title("SampleData", fontsize=15)
ax1.tick_params(labelsize=11.5)
ax1.set_xlabel('Day', fontsize=14)
ax1.set_ylabel('SampleData', fontsize=14)
fig.tight_layout()
The result should look like this:
fill_between ended up solving the problem:
import numpy as np
import matplotlib.pyplot as plt
SampleData=(375-367)*np.random.random_sample((365, 50))+367
SDist=np.zeros((365,11))
for i in range(11):
for t in range(365):
SDist[t,i]=np.percentile(SampleData[t,:],i*10)
x=np.arange(0,365,1)
fig, (ax1) = plt.subplots(nrows=1, ncols=1, sharex=True, figsize=(8,4))
ax1.set_color_cycle(['red'])
ax1.plot(x, SDist[:,5])
for i in range(6):
alph=0.05+(i/10.)
ax1.fill_between(x, SDist[:,0+i], SDist[:,10-i], color="red", alpha=alph)
ax1.set_title("SampleData", fontsize=15)
ax1.tick_params(labelsize=11.5)
ax1.set_xlabel('Day', fontsize=14)
ax1.set_ylabel('SampleData', fontsize=14)
fig.tight_layout()

plot ellipse in a seaborn scatter plot

I have a data frame in pandas format (pd.DataFrame) with columns = [z1,z2,Digit], and I did a scatter plot in seaborn:
dataframe = dataFrame.apply(pd.to_numeric, errors='coerce')
sns.lmplot("z1", "z2", data=dataframe, hue='Digit', fit_reg=False, size=10)
plt.show()
What I want to is plot an ellipse around each of these points. But I can't seem to plot an ellipse in the same figure.
I know the normal way to plot an ellipse is like:
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
elps = Ellipse((0, 0), 4, 2,edgecolor='b',facecolor='none')
a = plt.subplot(111, aspect='equal')
a.add_artist(elps)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.show()
But because I have to do "a = plt.subplot(111, aspect='equal')", the plot will be on a different figure. And I also can't do:
a = sns.lmplot("z1", "z2", data=rect, hue='Digit', fit_reg=False, size=10)
a.add_artist(elps)
because the 'a' returned by sns.lmplot() is of "seaborn.axisgrid.FacetGrid" object. Any solutions? Is there anyway I can plot an ellipse without having to something like a.set_artist()?
Seaborn's lmplot() used a FacetGrid object to do the plot, and therefore your variable a = lm.lmplot(...) is a reference to that FacetGrid object.
To add your elipse, you need a refence to the Axes object. The problem is that a FacetGrid can contain multiple axes depending on how you split your data. Thankfully there is a function FacetGrid.facet_axis(row_i, col_j) which can return a reference to a specific Axes object.
In your case, you would do:
a = sns.lmplot("z1", "z2", data=rect, hue='Digit', fit_reg=False, size=10)
ax = a.facet_axis(0,0)
ax.add_artist(elps)

Remove whitespace from matplotlib heatplot

I have a heatplot in matplotlib for which I want to remove the whitespace to the north and east of the plot, as shown in the image below.
here is the code I'm using to generate the plots:
# plotting
figsize=(50,20)
y,x = 1,2
fig, axarry = plt.subplots(y,x, figsize=figsize)
p = axarry[1].pcolormesh(copy_matrix.values)
# put the major ticks at the middle of each cell
axarry[1].set_xticks(np.arange(copy_matrix.shape[1])+0.5, minor=False)
axarry[1].set_yticks(np.arange(copy_matrix.shape[0])+0.5, minor=False)
axarry[1].set_title(file_name, fontweight='bold')
axarry[1].set_xticklabels(copy_matrix.columns, rotation=90)
axarry[1].set_yticklabels(copy_matrix.index)
fig.colorbar(p, ax=axarry[1])
Phylo.draw(tree, axes=axarry[0])
The easiest way to do this is to use ax.axis('tight').
By default, matplotlib tries to choose "even" numbers for the axes limits. If you want the plot to be scaled to the strict limits of your data, use ax.axis('tight'). ax.axis('image') is similar, but will also make the cells of your "heatmap" square.
For example:
import numpy as np
import matplotlib.pyplot as plt
# Note the non-"even" size... (not a multiple of 2, 5, or 10)
data = np.random.random((73, 78))
fig, axes = plt.subplots(ncols=3)
for ax, title in zip(axes, ['Default', 'axis("tight")', 'axis("image")']):
ax.pcolormesh(data)
ax.set(title=title)
axes[1].axis('tight')
axes[2].axis('image')
plt.show()

Categories

Resources