Python: Data Visualization of Outliers in a subplot - python

I use the following csv files: file1 and file2
to plot the following subplot:
The code to generate the subplot is the following:
df = {}
df[1] = pd.read_csv('file1.csv')
df[2] = pd.read_csv('file1.csv')
fig, axes = plt.subplots(1, 2, figsize=(10, 5), sharey=True)
for bet in [[1, 0], [2, 1]]:
betas = reg[bet[0]]
betas = betas.ix[int_col]
betas.dropna(inplace=1)
betas.index = range(25)
ax = betas.plot(ax=axes[bet[1]], grid=False, style=['b-', 'b--', 'b--'],
legend=None)
ax.lines[0].set_linewidth(1.5)
ax.lines[1].set_linewidth(0.6)
ax.lines[2].set_linewidth(0.6)
ax.axhline(y=0, color='k', linestyle='-', alpha=0.25, linewidth=0.5)
ax.axvline(x=13, color='k', linestyle='-', alpha=0.25, linewidth=0.5)
ax.set_xticks([0, 6, 13, 19, 24])
These plots show coefficients from a regression (solid blue lines) and the confidence intervals (dashed-lines).
As you can see, both plots in the subplot have outliers... the first point at x=0.
The outliers are important but it "deform" my graphs where the other points appear to be in a straight line but in fact there is important variations at x > 0.
What would be the proper data visualization to show both the outlier and have a better "zoom" on the other points at x > 0. Is a broken y-axis the best way? How can I do so in a subplot? Other suggestions?

Related

Why does merging two bar chart subplots into one change the axis and how can I fix this?

I have two dataframes:
df1=pd.DataFrame(10*np.random.rand(4,3),index=[2,3,4,5],columns=["I","J","K"])
df2=pd.DataFrame(10*np.random.rand(4,3),index=[1,2,3,4],columns=["I","J","K"])
After creating a bar chart I get:
Now I want to merge them into one figure so I tried:
fig, ax = plt.subplots(sharex=True)
ax1 = df1.plot.bar(legend=True, rot=0, stacked=True, width=0.1, position=1, colormap="bwr", ax=ax, alpha=0.7)
ax2 = df2.plot.bar(legend=True, rot=0, stacked=True, width=0.1, position=0, colormap="BrBG", ax=ax, alpha=0.7)
plt.show()
But the result isn't what I would expect:
As you can see, I would want the x-axis to have values 1, 2, 3, 4, 5 and the graphs to correspond to their original index value. Where is the problem and how could I fix it?
Also if it was possible I would need to set the new axis values automatically since I have many of these dataframes all with different axis values and inserting the new axis values manually would take a long time. Maybe I could use .unique() in the index column and implement this somehow?
You can reindex both dataframes to the same index:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame(10 * np.random.rand(4, 3), index=[2, 3, 4, 5], columns=["I", "J", "K"])
df2 = pd.DataFrame(10 * np.random.rand(4, 3), index=[1, 2, 3, 4], columns=["I", "J", "K"])
fig, ax = plt.subplots()
combined_index = df1.index.union(df2.index)
df1.reindex(combined_index).plot.bar(legend=True, rot=0, stacked=True, width=0.1, position=1,
colormap="bwr", alpha=0.7, ax=ax)
df2.reindex(combined_index).plot.bar(legend=True, rot=0, stacked=True, width=0.1, position=0,
colormap="BrBG", alpha=0.7, ax=ax)
plt.show()
Pandas bar plot creates categorical x-ticks (internally numbered 0,1,2,...) using the dataframe's index. First the internal tick positions are assigned, and then the labels. By using the same index for both dataframes, the positions will coincide.

matplotlib subplots last plot disturbs log scale

I am making a matplotlib figure with a 2x2 dimension where x- and y-axis are shared, and then loop over the different axes to plot in them. I'm plotting variant data per sample, and it is possible that a sample doesn't have variant data, so then I want the plot to say "NA" in the middle of it.
import matplotlib.pyplot as plt
n_plots_per_fig = 4
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols, sharex="all", sharey="all", figsize=(8, 6))
axs = axs.ravel()
for i, ax in enumerate(axs):
x = [1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3] # example values, but this list CAN be empty
bins = 3 # example bins
if x:
ax.hist(x, bins=bins) # plot the hist
ax.set_yscale("log")
ax.set_title(str(i), fontsize="medium")
else:
ax.set_title(str(i), fontsize="medium")
ax.text(0.5, 0.5, 'NA', ha='center', va='center', transform=ax.transAxes)
fig.show()
This works in almost every case; example of wanted output:
However, only if the last plot in the figure doesn't have any data, then this disturbs the log scale. Example code that triggers this:
import matplotlib.pyplot as plt
n_plots_per_fig = 4
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols, sharex="all", sharey="all", figsize=(8, 6))
axs = axs.ravel()
for i, ax in enumerate(axs):
x = [1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
bins = 3
if i == n_plots_per_fig-1: # this will distort the log scale
ax.set_title(str(i), fontsize="medium")
ax.text(0.5, 0.5, 'NA', ha='center', va='center', transform=ax.transAxes)
elif x:
ax.hist(x, bins=bins) # plot the hist
ax.set_yscale("log")
ax.set_title(str(i), fontsize="medium")
else:
ax.set_title(str(i), fontsize="medium")
ax.text(0.5, 0.5, 'NA', ha='center', va='center', transform=ax.transAxes)
fig.show()
The log scale is now set to really low values, and this is not what I want. I've tried several things to fix this, like unsharing the y-axes for the plot that doesn't have any data [ax.get_shared_y_axes().remove(axis) for axis in axs] or hiding the plot ax.set_visible(False), but none of this works. The one thing that does work is removing the axes from the plot with ax.remove(), but since this is the bottom most sample, this also removes the values for the x ticks for that column:
And besides that, I would still like the name of the sample that didn't have any data to be visible in the axes (and the "NA" text), and removing the axes doesn't allow this.
Any ideas on a fix?
Edit: I simplified my example.
You can set the limits manually with ax.set_xlim() / ax.set_ylim().
Note, that if you share the axes it does not matter on which subplot you call those functions. For example:
axs[-1][-1].set_ylim(1e0, 1e2)
If you do not know the limits before, you can infer it from the other plots:
x = np.random.random(100)
bins = 10
if bins != 0:
...
yy, xx = np.histogram(x, bins=bins)
ylim = yy.min(), yy.max()
xlim = xx.min(), xx.max()
else:
ax.set_xlim(xlim)
ax.set_ylim(ylim)

How to index a Matplotlib subplot

Im trying to plot two piecharts together. I have been reading the Matplotlib documentation https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_demo2.htmland cannot see what im doing wrong. I'm getting an indexing error in line 13 (patches = axs[1,1].pie...)
The code worked until I started using the axs[1,1] etc and tried to have the subplots.
Code
import matplotlib.pyplot as plt
from matplotlib import rcParams
print('\n'*10)
# Make figure and axes
fig, axs = plt.subplots(1,2)
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'Alpha', 'Beta', 'Gamma', 'Phi', 'Theta'
sizes = [3, 6, 2, 3, 10]
explode = (0, 0.1, 0, 0, 0) # only "explode" the 2nd slice (i.e. 'Hogs')
patches = axs[1,1].pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)[0]
#patches[2].set_hatch('\\\\') # Pie slice #0 hatched.
axs[1,1].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("My title", fontsize=14, fontweight='bold', size=16, y=1.02)
# Pie chart 2
labels = 'Alpha', 'Beta', 'Gamma', 'Phi', 'Theta'
sizes = [3, 6, 2, 3, 10]
explode = (0, 0.1, 0, 0, 0) # only "explode" the 2nd slice (i.e. 'Hogs')
patches = axs[1,2].pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)[0]
patches[2].set_hatch('\\\\') # Pie slice #0 hatched.
axs[1,2].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("My title", fontsize=14, fontweight='bold', size=16, y=1.02)
plt.show()
Traceback
Traceback (most recent call last):
File "/Users/.../Desktop/WORK/time_1.py", line 13, in <module>
patches = axs[1,1].pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
Array axs is 1-dimensional, change axs[1,1] and axs[1,2] to axs[0] and axs[1], then your code will work.
From matplotlib documentation.
# using the variable ax for single a Axes
fig, ax = plt.subplots()
# using the variable axs for multiple Axes
fig, axs = plt.subplots(2, 2)
# using tuple unpacking for multiple Axes
fig, (ax1, ax2) = plt.subplots(1, 2)
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
So your axs is just a numpy array of shape (2,).
changing the index should do the trick.
Change axs[1,1] --> axs[0] , axs[1,2]--> axs[1]
The problem is that Matplotlib squeezes the axs array into a 1D shape if there is only one row or only one column of subplots. Fortunately, this inconsistent bahaviour can be disabled by passing the squeeze argument:
fig, axs = plt.subplots(1, 2, squeeze=False)
And then you can just normally index into it with axs[0,0] and axs[0,1], like you would if there were multiple rows of subplots.
I would recommend to always pass squeeze=False, so that the behaviour is the same regardless of how many rows there are and automated plotting scripts don't need to come up with special cases for single-row plots (or else risk cryptic errors if somebody later on wants to generate a plot that happens to have only a single row).

How to set individual sub-figure size in seaborn plot?

I have a code that plots multiple plots - a heatmap and a barplot in a single plot in Seaborn. However, both plots are sized equally, i.e, one half of overall figure is heatmap, other half is barplot. Is there a way to control individual plot sizes such that heatmap occupies 75% of the plot size while barplot occupies only 25% of the plot?
Reference Code:
ig, ax = plt.subplots(1, 2, figsize=(7, 5))
heatmap = np.random.uniform(0, 1, size=(12, 12))
sns.heatmap(heatmap_scores, linewidth=0.5, cmap="OrRd", ax=ax[0])
ax[0].set_xlabel('Head')
ax[0].set_ylabel('Layer')
ax[0].set_title('Attention Heatmap')
x = np.mean(heatmap_scores, axis=1)
y = np.arange(0, 12)
sns.barplot(x=x, y=y, ax=ax[1], orient='h', color='r', dodge=False)
ax[1].set_title('Layer Average')
ax[1].set(yticklabels=[])
plt.savefig('fig.png')
plt.close()
You can customize GridSpec options, passing them to subplots, with the key gridspec_kw:
fig, ax = plt.subplots(1, 2, figsize=(7, 5), gridspec_kw={'width_ratios': [.75, .25]})

Difficulty aligning xticks to edge of Histogram bin

I am trying to show the frequency of my data throughout the hours of the day, using a histogram, in 3 hour intervals. I therefore use 8 bins.
plt.style.use('seaborn-colorblind')
plt.figure(figsize=(10,5))
plt.hist(comments19['comment_hour'], bins = 8, alpha = 1, align='mid', edgecolor = 'white', label = '2019', density=True)
plt.title('2019 comments, 8 bins')
plt.xticks([0,3,6,9,12,15,18,21,24])
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
However, the ticks are not aligning with the bin edges, as seen from the image below.
You can do either:
plt.figure(figsize=(10,5))
# define the bin and pass to plt.hist
bins = [0,3,6,9,12,15,18,21,24]
plt.hist(comments19['comment_hour'], bins = bins, alpha = 1, align='mid',
# remove this line
# plt.xticks([0,3,6,9,12,15,18,21,24])
edgecolor = 'white', label = '2019', density=True)
plt.title('2019 comments, 8 bins')
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
Or:
fig, ax = plt.subplots()
bins = np.arange(0,25,3)
comments19['comment_hour'].plot.hist(ax=ax,bins=bins)
# other plt format
If you set bins=8, seaborn will set 9 evenly spread boundaries, from the lowest value in the input array (0) to the highest (23), so at [0.0, 2.875, 5.75, 8.625, 11.5, 14.375, 17.25, 20.125, 23.0]. To get the 9 boundaries at 0, 3, 6, ... you need to set them explicitly.
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
plt.style.use('seaborn-colorblind')
comments19 = pd.DataFrame({'comment_hour': np.random.randint(0, 24, 100)})
plt.figure(figsize=(10, 5))
plt.hist(comments19['comment_hour'], bins=np.arange(0, 25, 3), alpha=1, align='mid', edgecolor='white', label='2019',
density=True)
plt.title('2019 comments, 8 bins')
plt.xticks(np.arange(0, 25, 3))
plt.xlabel('Hours of Day')
plt.ylabel('Relative Frequency')
plt.tight_layout()
plt.legend()
plt.show()
Note that your density=True means that the total area of the histogram is 1. As each bin is 3 hours wide, the sum of all the bin heights will be 0.33 and not 1.00 as you might expect. To really get a y-axis with relative frequencies, you could make the internal bin widths 1 by dividing the hours by 3. Afterwards you can relabel the x-axis back to hours.
So, following changes could be made for all the bins to sum to 100 %:
from matplotlib.ticker import PercentFormatter
plt.hist(comments19['comment_hour'] / 3, bins=np.arange(9), alpha=1, align='mid', edgecolor='white', label='2019',
density=True)
plt.xticks(np.arange(9), np.arange(0, 25, 3))
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))

Categories

Resources