Separate out (and keep) duplicate categorical data using Seaborn barplot? - python

I'm trying to plot some hypothetical student testing scores. I'd like to have student lastname on the y-axis and test score on the x-axis (horizontal barplot). Because Student names are non-unique, I'd like to allow duplicates on the y-axis. I've seen ways to get rid of duplicate data in seaborn and/or pandas, but not how to keep. Here's the code I have:
import seaborn as sns
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
scores = pd.read_csv('input_file.csv', sep=',').sort_values("score", ascending=True)
sns.set_color_codes("pastel")
sns.barplot(x="score", y="lastName", data=scores, color="b", ci=None)
plt.title('Scores')
sns.despine(left=True, bottom=True)
plt.savefig('path_to_file.pdf')
I thought that maybe I should be using factorplot and setting the orientation to "h" and type to "bar" but that produced a "tight layout" warning and, indeed, a tight/badly-rendered plot.
FYI, currently I have a barplot that looks nice enough, but it groups non-unique lastnames and sums their test scores; that's what I'm looking to fix.

You can plot a bar for each unique row (by using the index as your y-coordinate), and then manually assign y-axis tick labels.
df = pd.DataFrame({
'name': ['A', 'B', 'A', 'B'],
'score': [10, 20, 30, 40],
})
ax = sns.barplot(x=df.score, y=df.index, orient='h')
ax.set_yticklabels(df.name)
Note that for this task, Seaborn might actually be overkill; you aren't doing any statistical visualization. Since you don't need to group non-unique values and display confidence intervals, matplotlib.pyplot.barh is sufficient (just import seaborn for good-looking plots).
plt.barh(df.index, df.score, align='center')
plt.yticks(df.index, df.name)
plt.gca().invert_yaxis()

Related

Show median and quantiles on Seaborn pairplot (Python)

I am making a corner plot using Seaborn. I would like to display lines on each diagonal histogram showing the median value and quantiles. Example shown below.
I usually do this using the Python package 'corner', which is straightforward. I want to use Seaborn just because it has better aesthetics.
The seaborn plot was made using this code:
import seaborn as sns
df = pd.DataFrame(samples_new, columns = ['r1', 'r2', 'r3'])
cornerplot = sns.pairplot(df, corner=True, kind='kde',diag_kind="hist", diag_kws={'color':'darkslateblue', 'alpha':1, 'bins':10}, plot_kws={'color':'darkslateblue', 's':10, 'alpha':0.8, 'fill':False})
Seaborn provides test data sets that come in handy to explain something you want to change to the default behavior. That way, you don't need to generate your own test data, nor to supply your own data that can be complicated and/or sensitive.
To update the subplots in the diagonal, there is g.map_diag(...) which will call a given function for each individual column. It gets 3 parameters: the data used for the x-axis, a label and a color.
Here is an example to add vertical lines for the main quantiles, and change the title. You can add more calculations for further customizations.
import matplotlib.pyplot as plt
import seaborn as sns
def update_diag_func(data, label, color):
for val in data.quantile([.25, .5, .75]):
plt.axvline(val, ls=':', color=color)
plt.title(data.name, color=color)
iris = sns.load_dataset('iris')
g = sns.pairplot(iris, corner=True, diag_kws={'kde': True})
g.map_diag(update_diag_func)
g.fig.subplots_adjust(top=0.97) # provide some space for the titles
plt.show()
Seaborn is built ontop of matplotlib so you can try this:
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.DataFrame(samples_new, columns = ['r1', 'r2', 'r3'])
cornerplot = sns.pairplot(df, corner=True, kind='kde',diag_kind="hist", diag_kws={'color':'darkslateblue', 'alpha':1, 'bins':10}, plot_kws={'color':'darkslateblue', 's':10, 'alpha':0.8, 'fill':False})
plt.text(300, 250, "An annotation")
plt.show()

How to create a seaborn graph that shows probability per bin?

I would like to make a graph using seaborn. I have three types that are called 1, 2 and 3. In each type, there are groups P and F. I would like to present the graph in a way that each bin sums up to 100% and shows how many of each type are of group P and group F. I would also like to show the types as categorical rather than interpreted as numbers.
Could someone give me suggestions how to adapt the graph?
So far, I have used the following code:
sns.displot(data=df, x="TYPE", hue="GROUP", multiple="stack", discrete=1, stat="probability")
And this is the graph:
The option multiple='fill' stretches all bars to sum up to 1 (for 100%). You can use the new ax.bar_label() to label each bar.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(12345)
df = pd.DataFrame({'TYPE': np.random.randint(1, 4, 30),
'GROUP': np.random.choice(['P', 'F'], 30, p=[0.8, 0.2])})
g = sns.displot(data=df, x='TYPE', hue='GROUP', multiple='fill', discrete=True, stat='probability')
ax = g.axes.flat[0]
ax.set_xticks(np.unique(df['TYPE']))
for bars in ax.containers:
ax.bar_label(bars, label_type='center', fmt='%.2f' )
plt.show()

Multi Index Seaborn Line Plot

I have a multi index dataframe, with the two indices being Sample and Lithology
Sample 20EC-P 20EC-8 20EC-10-1 ... 20EC-43 20EC-45 20EC-54
Lithology Pd Di-Grd Gb ... Hbl Plag Pd Di-Grd Gb
Rb 7.401575 39.055118 6.456693 ... 0.629921 56.535433 11.653543
Ba 24.610102 43.067678 10.716841 ... 1.073115 58.520532 56.946630
Th 3.176471 19.647059 3.647059 ... 0.823529 29.647059 5.294118
I am trying to put it into a seaborn lineplot as such.
spider = sns.lineplot(data = data, hue = data.columns.get_level_values("Lithology"),
style = data.columns.get_level_values("Sample"),
dashes = False, palette = "deep")
The lineplot comes out as
1
I have two issues. First, I want to format hues by lithology and style by sample. Outside of the lineplot function, I can successfully access sample and lithology using data.columns.get_level_values, but in the lineplot they don't seem to do anything and I haven't figured out another way to access these values. Also, the lineplot reorganizes the x-axis by alphabetical order. I want to force it to keep the same order as the dataframe, but I don't see any way to do this in the documentation.
To use hue= and style=, seaborn prefers it's dataframes in long form. pd.melt() will combine all columns and create new columns with the old column names, and a column for the values. The index too needs to be converted to a regular column (with .reset_index()).
Most seaborn functions use order= to set an order on the x-values, but with lineplot the only way is to make the column categorical applying a fixed order.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
column_tuples = [('20EC-P', 'Pd '), ('20EC-8', 'Di-Grd'), ('20EC-10-1 ', 'Gb'),
('20EC-43', 'Hbl Plag Pd'), ('20EC-45', 'Di-Grd'), ('20EC-54', 'Gb')]
col_index = pd.MultiIndex.from_tuples(column_tuples, names=["Sample", "Lithology"])
data = pd.DataFrame(np.random.uniform(0, 50, size=(3, len(col_index))), columns=col_index, index=['Rb', 'Ba', 'Th'])
data_long = data.melt(ignore_index=False).reset_index()
data_long['index'] = pd.Categorical(data_long['index'], data.index) # make categorical, use order of the original dataframe
ax = sns.lineplot(data=data_long, x='index', y='value',
hue="Lithology", style="Sample", dashes=False, markers=True, palette="deep")
ax.set_xlabel('')
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1.02))
plt.tight_layout() # fit legend and labels into the figure
plt.show()
The long dataframe looks like:
index Sample Lithology value
0 Rb 20EC-P Pd 6.135005
1 Ba 20EC-P Pd 6.924961
2 Th 20EC-P Pd 44.270570
...

Setting different color for each class in a scatter plot which is made by using pd.pivot_table

I am new to Pandas and its libraries. By using the following code I can make a scatter plot of my 'class' in the plane 'Month' vs 'Amount'. Because I consider more than one class I would like to use colors for distinguishing each class and to see a legend in the figure.
Below my first attempt can generate dots for each given class having a different color but it can not generate the right legend. On the contrary the second attempt can generate the right legend but labeling is not correct. I can indeed visualize the first letter of each class name. Moreover this second attempt plots as many figures as the number of classes. I would like to see how I can correct both my attempts. Any ideas? suggestions? Thanks in advance.
ps. I wanted to use
colors = itertools.cycle(['gold','blue','red','chocolate','mediumpurple','dodgerblue'])
as well, so that I could decide the colors. I could not make it though.
Attempts:
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt
import matplotlib.cm as cm
np.random.seed(176)
random.seed(16)
df = pd.DataFrame({'class': random.sample(['living room','dining room','kitchen','car','bathroom','office']*10, k=25),
'Amount': np.random.sample(25)*100,
'Year': random.sample(list(range(2010,2018))*50, k=25),
'Month': random.sample(list(range(1,12))*100, k=25)})
print(df.head(25))
print(df['class'].unique())
for cls1 in df['class'].unique():
test1= pd.pivot_table(df[df['class']==cls1], index=['class', 'Month', 'Year'], values=['Amount'])
print(test1)
colors = cm.rainbow(np.linspace(0,2,len(df['class'].unique())))
fig, ax = plt.subplots(figsize=(8,6))
for cls1,c in zip(df['class'].unique(),colors):
# SCATTER PLOT
test = pd.pivot_table(df[df['class']==cls1], index=['class', 'Month', 'Year'], values=['Amount'], aggfunc=np.sum).reset_index()
test.plot(kind='scatter', x='Month',y='Amount', figsize=(16,6),stacked=False,ax=ax,color=c,s=50).legend(df['class'].unique(),scatterpoints=1,loc='upper left',ncol=3,fontsize=10.5)
plt.show()
for cls2,c in zip(df['class'].unique(),colors):
# SCATTER PLOT
test = pd.pivot_table(df[df['class']==cls2], index=['class', 'Month', 'Year'], values=['Amount'], aggfunc=np.sum).reset_index()
test.plot(kind='scatter', x='Month',y='Amount', figsize=(16,6),stacked=False,color=c,s=50).legend(cls2,scatterpoints=1,loc='upper left',ncol=3,fontsize=10.5)
plt.show()
enter image description here
Up-to-date code
I would like to plot the following code via scatter plot.
for cls1 in df['class'].unique():
test3= pd.pivot_table(df[df['class']==cls1], index=['class', 'Month'], values=['Amount'], aggfunc=np.sum)
print(test3)
Unlike above here a class appears only once each month thanks to the sum over Amount.
Here my attempt:
for cls2 in df['class'].unique():
test2= pd.pivot_table(df[df['class']==cls2], index=['class','Year'], values=['Amount'], aggfunc=np.sum).reset_index()
print(test2)
sns.lmplot(x='Year' , y='Amount', data=test2, hue='class',palette='hls', fit_reg=False,size= 5, aspect=5/3, legend_out=False,scatter_kws={"s": 70})
plt.show()
This gives me one plot for each class. A part from the first one (class=car) which shows different colors, the others seem to be ok. Despite this, I would like to have only one plot with all classes..
After the Marvin Taschenberger's useful help here is up-to-date result:
enter image description here
I get a white dot instead a colorful one and the legend has a different place in the figure with respect to your figure. Moreover I can not see the year labels correctly. Why?
An easy way to work around ( unfortunately not solving) your problem is letting seaborn deal with the heavy lifting due to the simple line
sns.lmplot(x='Month' , y='Amount', data=df, hue='class',palette='hls', fit_reg=False,size= 8, aspect=5/3, legend_out=False)
You could also plug in other colors for palette
EDIT : how about this then :
`
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt
import seaborn as sns
np.random.seed(176)
random.seed(16)
df = pd.DataFrame({'class': random.sample(['living room','dining room','kitchen','car','bathroom','office']*10, k=25),
'Amount': np.random.sample(25)*100,
'Year': random.sample(list(range(2010,2018))*50, k=25),
'Month': random.sample(list(range(1,12))*100, k=25)})
frame = pd.pivot_table(df, index=['class','Year'], values=['Amount'], aggfunc=np.sum).reset_index()
sns.lmplot(x='Year' , y='Amount', data=frame, hue='class',palette='hls', fit_reg=False,size= 5, aspect=5/3, legend_out=False,scatter_kws={"s": 70})
plt.show()

Plot each column of Pandas dataframe pairwise against one column

I have a pandas dataframe where one of the columns is a set of labels that I would like to plot each of the other columns against in subplots. In other words, I want the y-axis of each subplot to use the same column, called 'labels', and I want a subplot for each of the remaining columns with the data from each column on the x-axis. I expected the following code snippet to achieve this, but I don't understand why this results in a single nonsensical plot:
examples.plot(subplots=True, layout=(-1, 3), figsize=(20, 20), y='labels', sharey=False)
The problem with that code is that you didn't specify an x value. It seems nonsensical because it's plotting the labels column against an index from 0 to the number of rows. As far as I know, you can't do what you want in pandas directly. You might want to check out seaborn though, it's another visualization library that has some nice grid plotting helpers.
Here's an example with your data:
import pandas as pd
import seaborn as sns
import numpy as np
examples = pd.DataFrame(np.random.rand(10,4), columns=['a', 'b', 'c', 'labels'])
g = sns.PairGrid(examples, x_vars=['a', 'b', 'c'], y_vars='labels')
g = g.map(plt.plot)
This creates the following plot:
Obviously it doesn't look great with random data, but hopefully with your data it will look better.

Categories

Resources