Hide non observed categories in a seaborn boxplot - python

I am currently working on a data analysis, and want to show some data distributions through seaborn boxplots.
I have a categorical data, 'seg1' which can in my dataset take 3 values ('Z1', 'Z3', 'Z4'). However, data in group 'Z4' is too exotic to be reported for me, and I would like to produce boxplots showing only categories 'Z1' and 'Z3'.
Filtering the data source of the plot did not work, as category 'Z4' is still showed with no data point.
Is there any other solution than having to create a new CategoricalDtype with only ('Z1', 'Z3') and cast/project my data back on this new category?
I would simply like to hide 'Z4' category.
I am using seaborn 0.10.1 and matplotlib 3.3.1.
Thanks in advance for your answers.
My tries are below, and some data to reproduce.
Dummy data
dummy_cat = pd.CategoricalDtype(['a', 'b', 'c'])
df = pd.DataFrame({'col1': ['a', 'b', 'a', 'b'], 'col2': [12., 5., 3., 2]})
df.col1 = df.col1.astype(dummy_cat)
sns.boxplot(data=df, x='col1', y='col2')
Apply no filter
fig, axs = plt.subplots(figsize=(8, 25), nrows=len(indicators2), squeeze=False)
for j, indicator in enumerate(indicators2):
sns.boxplot(data=orders, y=indicator, x='seg1', hue='origin2', ax=axs[j, 0], showfliers=False)
Which produces:
Filter data source
mask_filter = orders.seg1.isin(['Z1', 'Z3'])
fig, axs = plt.subplots(figsize=(8, 25), nrows=len(indicators2), squeeze=False)
for j, indicator in enumerate(indicators2):
sns.boxplot(data=orders.loc[mask_filter], y=indicator, x='seg1', hue='origin2', ax=axs[j, 0], showfliers=False)
Which produces:

To cut off the last (or first) x-value, set_xlim() can be used, e.g. ax.set_xlim(-0.5, 1.5).
Another option is to work with seaborn's order= parameter and only add the desired values in that list. Optionally that can be created programmatically:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
dummy_cat = pd.CategoricalDtype(['a', 'b', 'c'])
df = pd.DataFrame({'col1': ['a', 'b', 'a', 'b'], 'col2': [12., 5., 3., 2]})
df.col1 = df.col1.astype(dummy_cat)
order = [cat for cat in dummy_cat.categories if df['col1'].str.contains(cat).any()]
sns.boxplot(data=df, x='col1', y='col2', order=order)
plt.show()

Related

Side-by-side boxplots from two pandas in one figure

I have two pandas dataframes containing data for three different categories: 'a', 'b' and 'c'.
import pandas as pd
import numpy as np
n=100
df_a = pd.DataFrame({'id': np.ravel([['a' for i in range(n)], ['b' for i in range(n)], ['c' for i in range(n)]]),
'val': np.random.normal(0, 1, 3*n)})
df_b = pd.DataFrame({'id': np.ravel([['a' for i in range(n)], ['b' for i in range(n)], ['c' for i in range(n)]]),
'val': np.random.normal(1, 1, 3*n)})
I would like to illustrate the differences in 'a', 'b' and 'c' between the two dataframes, and for that I want to use boxplots. I.e., for each category ('a', 'b' and 'c'), I want to make side-by-side boxplots - and they should all be in the same figure.
So one figure containing 6 boxplots, 2 per category. How can I achieve this the easiest?
IIUC:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(3, 2)
for j, df in enumerate([df_a, df_b]):
for i, cat in enumerate(sorted(df['id'].unique())):
df[df['id'] == cat].boxplot('val', 'id', ax=axes[i, j])
plt.tight_layout()
plt.show()
does this help? I tried to make it somewhat dynamic/ flexible
import matplotlib.pyplot as plt
import pandas
import seaborn as sns
ids = [val for val in df_a["id"].unique() for _ in (0, 1)]
fig, ax = plt.subplots(len(ids)//2,2, figsize=(10,10))
plt.subplots_adjust(hspace=0.5, wspace=0.3)
plt.suptitle("df_a vs. df_b")
ax = ax.ravel()
for i, id in enumerate(ids):
if i%2 == 0:
ax[i] = sns.boxplot(x=df_a[df_a.id == id]["val"], ax = ax[i])
else:
ax[i] = sns.boxplot(x=df_b[df_b.id == id]["val"], ax = ax[i])
ax[i].set_title(id)
sns.despine()
You could add an extra column to indicate the dataset and then concatenate the dataframes:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
n = 100
df_a = pd.DataFrame({'id': np.ravel([['a' for i in range(n)], ['b' for i in range(n)], ['c' for i in range(n)]]),
'val': np.random.normal(0, 1, 3 * n)})
df_b = pd.DataFrame({'id': np.ravel([['a' for i in range(n)], ['b' for i in range(n)], ['c' for i in range(n)]]),
'val': np.random.normal(1, 1, 3 * n)})
df_a['dataset'] = 'set a'
df_b['dataset'] = 'set b'
sns.boxplot(data=pd.concat([df_a, df_b]), x='id', y='val', hue='dataset', palette='spring')
plt.tight_layout()
plt.show()
PS: Note that in matplotlib (and seaborn, which builds upon it), a figure is a plot with one or more subplots (referred to as ax). As you write figure instead of plot, it might give the impression that you want multiple subplots. You can use sns.catplot(...., kind='box') to create multiple subplots from the concatenated dataframe.

Non overlapping error bars in line plot

I am using Pandas and Matplotlib to create some plots. I want line plots with error bars on them. The code I am using currently looks like this
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
df_yerr = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
fig, ax = plt.subplots()
df.plot(yerr=df_yerr, ax=ax, fmt="o-", capsize=5)
ax.set_xscale("log")
plt.show()
With this code, I get 6 lines on a single plot (which is what I want). However, the error bars completely overlap, making the plot difficult to read.
Is there a way I could slightly shift the position of each point on the x-axis so that the error bars no longer overlap?
Here is a screenshot:
One way to achieve what you want is to plot the error bars 'by hand', but it is neither straight forward nor much better looking than your original. Basically, what you do is make pandas produce the line plot and then iterate through the data frame columns and do a pyplot errorbar plot for each of them such, that the index is slightly shifted sideways (in your case, with the logarithmic scale on the x axis, this would be a shift by a factor). In the error bar plots, the marker size is set to zero:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
colors = ['red','blue','green','yellow','purple','black']
df = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
df_yerr = pd.DataFrame(index=[10,100,1000,10000], columns=['A', 'B', 'C', 'D', 'E', 'F'], data=np.random.rand(4,6))
fig, ax = plt.subplots()
df.plot(ax=ax, marker="o",color=colors)
index = df.index
rows = len(index)
columns = len(df.columns)
factor = 0.95
for column,color in zip(range(columns),colors):
y = df.values[:,column]
yerr = df_yerr.values[:,column]
ax.errorbar(
df.index*factor, y, yerr=yerr, markersize=0, capsize=5,color=color,
zorder = 10,
)
factor *= 1.02
ax.set_xscale("log")
plt.show()
As I said, the result is not pretty:
UPDATE
In my opinion a bar plot would be much more informative:
fig2,ax2 = plt.subplots()
df.plot(kind='bar',yerr=df_yerr, ax=ax2)
plt.show()
you can solve with alpha for examples
df.plot(yerr=df_yerr, ax=ax, fmt="o-", capsize=5,alpha=0.5)
You can also check this link for reference

Remove anti-aliasing for pandas plot.area

I want to plot stacked areas with Python, and find out this Pandas' function:
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area();
However, the result is weirdly antialiased, mixing together the colors, as shown on those 2 plots:
The same problem occurs in the example provided in the documentation.
Do you know how to remove this anti-aliasing? (Or another mean to get a neat output for stacked representation of line plots.)
Using a matplotlib stack plot works fine
fig, ax = plt.subplots()
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
ax.stackplot(df.index, df.values.T)
Since the area plot is a stackplot, the only difference would be the linewidth of the areas, which you can set to zero.
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area(linewidth=0)
The remaining grayish lines are then indeed due to antialiasing. You may turn that off in the matplotlib plot
fig, ax = plt.subplots()
ax.stackplot(df.index, df.values.T, antialiased=False)
The result however, may not be visually appealing:
It looks like there are two boundaries.
Try a zero line width:
df.plot.area(lw=0);

Plot duplication in Pandas Plot()

There is an issue with the plot() function in Pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'A', 'B'])
ax = df.plot()
ax.legend(ncol=1, bbox_to_anchor=(1., 1, 0., 0), loc=2 , prop={'size':6})
This will make a plot with too many lines. Note however that half will be on top of each other. It seems to have something to do with the axis because when I do not use them the issue goes away.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'A', 'B'])
df.plot()
UPDATE
While not idea for my use case the issue can be fixed by using MultiIndex
columns = pd.MultiIndex.from_arrays([np.hstack([ ['left']*2, ['right']*2]), ['A', 'B']*2], names=['High', 'Low'])
df = pd.DataFrame(np.random.randn(8, 4), columns=columns)
ax = df.plot()
ax.legend(ncol=1, bbox_to_anchor=(1., 1, 0., 0), loc=2 , prop={'size':16})
It has to do with your duplication of column names, not ax at all (if you call plt.legend after your second example you see the same extra lines). Having multiple columns with the same name is confusing the call to DataFrame.plot_frame.
If you change your columns to ['A', 'B', 'C', 'D'] instead, it's fine.

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Categories

Resources