How to create a seaborn graph that shows probability per bin? - python

I would like to make a graph using seaborn. I have three types that are called 1, 2 and 3. In each type, there are groups P and F. I would like to present the graph in a way that each bin sums up to 100% and shows how many of each type are of group P and group F. I would also like to show the types as categorical rather than interpreted as numbers.
Could someone give me suggestions how to adapt the graph?
So far, I have used the following code:
sns.displot(data=df, x="TYPE", hue="GROUP", multiple="stack", discrete=1, stat="probability")
And this is the graph:

The option multiple='fill' stretches all bars to sum up to 1 (for 100%). You can use the new ax.bar_label() to label each bar.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(12345)
df = pd.DataFrame({'TYPE': np.random.randint(1, 4, 30),
'GROUP': np.random.choice(['P', 'F'], 30, p=[0.8, 0.2])})
g = sns.displot(data=df, x='TYPE', hue='GROUP', multiple='fill', discrete=True, stat='probability')
ax = g.axes.flat[0]
ax.set_xticks(np.unique(df['TYPE']))
for bars in ax.containers:
ax.bar_label(bars, label_type='center', fmt='%.2f' )
plt.show()

Related

Show median and quantiles on Seaborn pairplot (Python)

I am making a corner plot using Seaborn. I would like to display lines on each diagonal histogram showing the median value and quantiles. Example shown below.
I usually do this using the Python package 'corner', which is straightforward. I want to use Seaborn just because it has better aesthetics.
The seaborn plot was made using this code:
import seaborn as sns
df = pd.DataFrame(samples_new, columns = ['r1', 'r2', 'r3'])
cornerplot = sns.pairplot(df, corner=True, kind='kde',diag_kind="hist", diag_kws={'color':'darkslateblue', 'alpha':1, 'bins':10}, plot_kws={'color':'darkslateblue', 's':10, 'alpha':0.8, 'fill':False})
Seaborn provides test data sets that come in handy to explain something you want to change to the default behavior. That way, you don't need to generate your own test data, nor to supply your own data that can be complicated and/or sensitive.
To update the subplots in the diagonal, there is g.map_diag(...) which will call a given function for each individual column. It gets 3 parameters: the data used for the x-axis, a label and a color.
Here is an example to add vertical lines for the main quantiles, and change the title. You can add more calculations for further customizations.
import matplotlib.pyplot as plt
import seaborn as sns
def update_diag_func(data, label, color):
for val in data.quantile([.25, .5, .75]):
plt.axvline(val, ls=':', color=color)
plt.title(data.name, color=color)
iris = sns.load_dataset('iris')
g = sns.pairplot(iris, corner=True, diag_kws={'kde': True})
g.map_diag(update_diag_func)
g.fig.subplots_adjust(top=0.97) # provide some space for the titles
plt.show()
Seaborn is built ontop of matplotlib so you can try this:
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.DataFrame(samples_new, columns = ['r1', 'r2', 'r3'])
cornerplot = sns.pairplot(df, corner=True, kind='kde',diag_kind="hist", diag_kws={'color':'darkslateblue', 'alpha':1, 'bins':10}, plot_kws={'color':'darkslateblue', 's':10, 'alpha':0.8, 'fill':False})
plt.text(300, 250, "An annotation")
plt.show()

can you highlight specific observations in categorical scatter plot in seaborn?

I have 8 categories and I already plotted categorical scatter plot with sns.catplot. Is there a way to highlight a specific observation(s) in each category to compare the positions with respect to other observations?
You could use text annotations, using the annotate method on the ax (matplotlib.axes.Axes) attribute of the FaceGrid object returned by seaborn.catplot. For example, the code below annotates the observations that are greater than .5 on a normal sample:
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.DataFrame(data={'x': range(10), 'y':np.random.normal(0,1,size=10)})
df['odd'] = df.x.apply(lambda x: x % 2)
g = sns.catplot(data=df, x='x', y='y', hue='odd')
df[df.y > .5].apply(lambda p: g.ax.annotate(f'({p.x}, {round(p.y, 2)})', (p.x, p.y)), axis=1)
You can see more details on the annotate method here.

Time Frequency Color Map

I'd like to show the occurrence in a color map for the frequency of a point , i.e. (1,2) has a frequency of 3 points while still keeping my 'xaxis' (i.e. df['A'])
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'A': [1,1,1,1,2,2,3,4,6,7,7],
'B': [2,2,2,3,3,4,5,6,7,8,8]})
plt.figure()
plt.scatter(df['A'], df['B'])
plt.show()
Here is my current plot
I'd like to keep the same axis I have, while adding the colormap. Hope I was being clear.
You can calculate the frequency of a certain value using the collections package.
freq_dic = collections.Counter(df["B"])
You then need to add this new list to your dataframe and add two new options to the scatter plot. The colormap legend is displayed with plt.colorbar. This code is far from perfect, so any further improvements are very welcome.
import pandas as pd
import matplotlib.pyplot as plt
import collections
df = pd.DataFrame({'A': [1,1,1,1,2,2,3,4,6,7,7],
'B': [2,2,2,3,3,4,5,6,7,8,8]})
freq_dic = collections.Counter(df["B"])
for index, entry in enumerate(df["B"]):
df.at[index, 'freq'] = (freq_dic[entry])
plt.figure()
plt.scatter(df['A'], df['B'],
c=df['freq'],
cmap='viridis')
plt.colorbar()
plt.show()

Different point size based on hue argument in seaborn

I am trying to have different point sizes on a seaboard scatterplot depending on the value on the "hue" column of my dataframe.
sns.scatterplot(x="X", y="Y", data=df, hue='value',style='value')
value can take 3 different values (0,1 and 2) and I would like points which value is 2 to be bigger on the graph.
I tried the sizes argument :
sizes=(1,1,4)
But could not get it done this way.
Let's use the s parameter and pass a list of sizes using a function of df['value'] to scale the point sizes:
df = pd.DataFrame({'X':[1,2,3],'Y':[1,4,9],'value':[1,0,2]})
import seaborn as sns
_ = sns.scatterplot(x='X',y='Y', data=df, s=df['value']*50+10)
Output:
Using seaborn scatterplots arguments:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'X':[1,2,3,4,5],'Y':[1,2,3,4,5],
'value':[1,1,0,2,2]})
df["size"] = np.where(df["value"] == 2, "Big", "Small")
sns.scatterplot(x="X", y="Y", hue='value', size="size",
data=df, size_order=("Small", "Big"), sizes=(160, 40))
plt.show()
Note that the order of sizes needs to be reveresed compared to the size_order. I have no idea why that would make sense, though.

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Categories

Resources