Related
I have data (duration of a certain activity) for two categories (Monday, Tuesday). I would like to generate a bar chart (see 1). Bars above a threshold (different for both categories) should have a different color; e.g. on Mondays data above 10 hours should be blue and on Tuesdays above 12 hours. Any ideas how I could implement this in seaborn or matplotlib?
Thank you very much.
Monday = [5,6,8,12,5,20,4, 8]
Tuesday=[3,5,8,12,4,17]
Goal
You could draw two barplots, using an array of booleans for the coloring (hue):
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
monday = np.array([5, 6, 8, 12, 5, 20, 4, 8])
tuesday = np.array([3, 5, 8, 12, 4, 17])
sns.set_style('whitegrid')
fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)
palette = {False: 'skyblue', True: 'tomato'}
sns.barplot(x=np.arange(len(monday)), y=monday, hue=monday >= 10, palette=palette, dodge=False, ax=ax0)
ax0.set_xlabel('Monday', size=20)
ax0.set_xticks([])
ax0.legend_.remove()
sns.barplot(x=np.arange(len(tuesday)), y=tuesday, hue=tuesday >= 12, palette=palette, dodge=False, ax=ax1)
ax1.set_xlabel('Tuesday', size=20)
ax1.set_xticks([])
ax1.legend_.remove()
sns.despine()
plt.tight_layout()
plt.subplots_adjust(wspace=0)
plt.show()
I can plot a histogram in Python for example with matplotlib:
from matplotlib import pyplot as plt
x = [3,5,12,7,8,6,4,6]
plt.hist(x)
However I have a second array y = [4,6,8,2,4,5,8,7] where each value corresponds to the value at the same position of x. Now I would like to create a histogram where each bar's height is defined by x, but each bar's color is defined by the values in y that belong to its x values. You could also say I have tuples as in list(zip(x,y)) where the first value should be used for the histogram itself and the mean value of the second tuple value in each bin should determine the color.
np.unique(x, return_counts=True) returns an array with the unique values of x and their count.
Converting everything to numpy arrays, y[x == val] selects the subset of y at each position where x is equal to val. y[x == val].mean() gets the mean of those values. Calling cmap(norm(...)) gives the color corresponding to that value. The cmap and norm can be used to create a colorbar.
Here is some example code, including embellishments to change ticks, margins and spines:
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib.cm import ScalarMappable
import numpy as np
x = np.array([3, 5, 12, 7, 8, 6, 4, 6])
y = np.array([4, 6, 8, 2, 4, 5, 8, 7])
values, counts = np.unique(x, return_counts=True)
cmap = plt.get_cmap('inferno')
norm = plt.Normalize(0, y.max()) # or plt.Normalize(y.min(), y.max())
colors = [cmap(norm(y[x == val].mean())) for val in values]
fig, ax = plt.subplots()
ax.bar(values, counts, color=colors, edgecolor='black')
ax.yaxis.set_major_locator(MultipleLocator(1))
ax.xaxis.set_major_locator(MultipleLocator(1))
ax.set_ylabel('Count')
ax.margins(x=0.02, y=0)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.colorbar(ScalarMappable(cmap=cmap, norm=norm), pad=0.02, ax=ax)
plt.show()
Here is another example, using the tips dataset from seaborn, with the rounded total_bill on the x-axis, the count on the y-axis and colored via the tip amount.
import seaborn as sns
tips = sns.load_dataset('tips')
x = np.round(tips['total_bill'])
y = np.array(tips['tip'])
values, counts = np.unique(x, return_counts=True)
cmap = plt.get_cmap('turbo')
PS: As mentioned in #Arne's answer, seaborn can be used to replace the norm and color assignment with seaborn's hue. Without embelishments, the code would look like:
import numpy as np
import seaborn as sns
x = np.array([3, 5, 12, 7, 8, 6, 4, 6])
y = np.array([4, 6, 8, 2, 4, 5, 8, 7])
values, counts = np.unique(x, return_counts=True)
sns.set_style('darkgrid')
ax = sns.barplot(x=values, y=counts, hue=[y[x == val].mean() for val in values],
palette='inferno', dodge=False)
The seaborn library is very useful to visualize multi-dimensional data like these. You could store x and y in a pandas dataframe and then add the bin numbers and the average y values per bin:
import numpy as np
import pandas as pd
import seaborn as sns
x = [3, 5, 12, 7, 8, 6, 4, 6]
y = [4, 6, 8, 2, 4, 5, 8, 7]
n_bins = 4 # number of bins for the histogram
df = pd.DataFrame({'x': x, 'y': y})
_, bin_edges = np.histogram(x, bins=n_bins)
df['bin'] = pd.cut(x, bins=bin_edges, labels=False, include_lowest=True)
color = df.groupby('bin').mean()['y']
df['color'] = df.bin.apply(lambda k: color[k])
df
x y bin color
0 3 4 0 6.000000
1 5 6 0 6.000000
2 12 8 3 8.000000
3 7 2 1 4.666667
4 8 4 2 4.000000
5 6 5 1 4.666667
6 4 8 0 6.000000
7 6 7 1 4.666667
Then drawing the colored histogram is easy:
sns.histplot(data=df, x='x', bins=bin_edges, hue='color');
I am using following code to make 5 bars on 3 different data sets a, b and c. How can I show all colors in each bar. I don't want their value to add up. For example, in first bar if the value of Green is 1, Yellow is 3 and Red is 6 I don't want the final value to be 10 rather it should be 6 but all colors should appear till their final value. I don't want to use transparent colors or only bar outlines.
import matplotlib.pyplot as plt
import numpy as np
a = [1, 2, 3, 4, 5]
b = [3, 4, 1, 10, 9]
c = [6, 7, 2, 4, 6]
ind = np.arange(len(a))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(x=ind, height=a, width=0.35, align='center', label='Green',
facecolor='g')
ax.bar(x=ind, height=b, width=0.35, align='center', label='Yellow',
facecolor='y')
ax.bar(x=ind, height=c, width=0.35, align='center', label='Red', facecolor='r')
plt.xticks(ind, a)
plt.xlabel('Coordination Number')
plt.ylabel('Frequency')
plt.legend()
plt.show()
The reference value for the 'a' column is 6, but it was unclear if it is the maximum value. I understood it to be the maximum value and calculated the composition ratio.
I created a stacked graph based on the results.
import numpy as np
import pandas as pd
a = [1, 2, 3, 4, 5]
b = [3, 4, 1, 10, 9]
c = [6, 7, 2, 4, 6]
ind = np.arange(len(a))
df = pd.DataFrame({'a':a,'b':b,'c':c}, index=ind)
df['total'] = df.sum(axis=1)
df['max'] = df[['a','b','c']].max(axis=1)
df['aa'] = df['max']*(df['a']/df['total'])
df['bb'] = df['max']*(df['b']/df['total'])
df['cc'] = df['max']*(df['c']/df['total'])
df
a b c total max aa bb cc
0 1 3 6 10 6 0.600000 1.800000 3.600000
1 2 4 7 13 7 1.076923 2.153846 3.769231
2 3 1 2 6 3 1.500000 0.500000 1.000000
3 4 10 4 18 10 2.222222 5.555556 2.222222
4 5 9 6 20 9 2.250000 4.050000 2.700000
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(x=ind, height=df.loc[:,'aa'], bottom=0, width=0.35, align='center', label='Green',
facecolor='g')
ax.bar(x=ind, height=df.loc[:,'bb'], bottom=df.loc[:,'aa'], width=0.35, align='center', label='Yellow',
facecolor='y')
ax.bar(x=ind, height=df.loc[:,'cc'], bottom=df.loc[:,'aa']+df.loc[:,'bb'], width=0.35, align='center', label='Red', facecolor='r')
plt.xticks(ind, a)
plt.xlabel('Coordination Number')
plt.ylabel('Frequency')
plt.legend()
plt.show()
If I understand your question correctly, you want to show all colour bars starting from the same zero baseline and grouped together under their corresponding Number?
I'll use bokeh for plotting, since it provides an easy way to "offset" each bar in the group. To vary the amount of visual offset for each bar, change the second parameter of the dodge function. For this combination of widths, 0.05 seemed like a nice value.
from bokeh.io import output_notebook, output_file, show
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.transform import dodge
output_notebook() # or output_file("chart.html") if not using Jupyter
x_axis_values = [str(x) for x in range(1, 6)]
data = {
"Coordination Number" : x_axis_values,
"Green" : [1, 2, 3, 4, 5],
"Yellow" : [3, 4, 1, 10, 9],
"Red" : [6, 7, 2, 4, 6]
}
src = ColumnDataSource(data=data)
p = figure(
x_range=x_axis_values, y_range=(0, 10), plot_height=275,
title="Offset Group Bar Chart", toolbar_location=None, tools="")
p.vbar(
x=dodge('Coordination Number', -0.05, range=p.x_range),
top='Green', width=0.2, source=src, color="#8DD3C7", legend_label="Green")
p.vbar(
x=dodge('Coordination Number', 0.0, range=p.x_range),
top='Yellow', width=0.2, source=src, color="#FFD92F", legend_label="Yellow")
p.vbar(
x=dodge('Coordination Number', 0.05, range=p.x_range),
top='Red', width=0.2, source=src, color="#E15759", legend_label="Red")
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
p.xaxis.axis_label = "Coordination Number"
p.yaxis.axis_label = "Frequency"
show(p)
I am trying to plot groups of data which have different length of data. Do you have any idea how I can visualize a female list containing only two objects without filling up the rest of them with zeros to get the length of the male list?
This is the code, which I got so far:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
labels = ['G1', 'G2', 'G3', 'G4']
male = [1, 3, 10, 20]
female = [2, 7]
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, male, width, label='male')
rects2 = ax.bar(x + width/2, female, width, label='female')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
fig.tight_layout()
plt.show()
You can make two different array for the x-positions:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
labels = ['G1', 'G2', 'G3', 'G4']
male = [1, 3, 10, 20]
female = [2, 7]
x_male = np.arange(len(male))
x_female = np.arange(len(female))
offset_male = np.zeros(len(male))
offset_female = np.zeros(len(female))
shorter = min(len(x_male), len(x_female))
width = 0.35 # the width of the bars
offset_male[:shorter] = width/2
offset_female[:shorter] = width/2
fig, ax = plt.subplots()
rects1 = ax.bar(x_male - offset_male, male, width, label='male')
rects2 = ax.bar(x_female + offset_female, female, width, label='female')
That said, this solution only works when values are missing at the end of the shorter list. For values missing within the list, it would be better to use None, or np.nan, as suggested by #desert_ranger.
If you don't want to fill them up with zeros, you could assign NAN values to them -
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
labels = ['G1', 'G2', 'G3', 'G4']
male = [1, 3, 10, 20]
female = [2, 7,np.nan,np.nan]
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
ax.bar(x - width/2, male, width, label='male')
ax.bar(x + width/2, female, width, label='female')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
fig.tight_layout()
plt.show()
Use pandas and plot with pandas.DataFrame.plot.bar
Tested in python 3.8.11, pandas 1.3.1, and matplotlib 3.4.2
Option 1
Column data must be the same length when creating a dataframe, therefore use itertools.zip_longest to zip lists of unequal length, which fills the missing data with a fillvalue.
import pandas as pd
import matplotlib.pyplot as plt
from itertools import zip_longest
# data
labels = ['G1', 'G2', 'G3', 'G4']
male = [1, 3, 10, 20]
female = [2, 7]
# zip lists together
data = zip_longest(male, female)
# create dataframe from data
df = pd.DataFrame(data, columns=['male', 'female'], index=labels)
male female
G1 1 2.0
G2 3 7.0
G3 10 NaN
G4 20 NaN
# plot
p = df.plot.bar(rot=0)
plt.show()
Option 2
Row do not need to be the same length to add to the dataframe
import pandas as pd
import matplotlib.pyplot as plt
# data
labels = ['G1', 'G2', 'G3', 'G4']
male = [1, 3, 10, 20]
female = [2, 7]
# create a dataframe from the lists
df = pd.DataFrame([male, female], columns=labels, index=['male', 'female'])
G1 G2 G3 G4
male 1 3 10.0 20.0
female 2 7 NaN NaN
# plot
p = df.plot.bar(rot=0)
I would like to plot two columns of a pandas dataframe as side by side box plots by category. This is not the same as the question presented in here: Grouped boxplot with seaborn where the two columns have lists inside them. The solution there did not work for me.
MWE
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
[
[2, 4, "A"],
[4, 5, "C"],
[5, 4, "B"],
[10, 4.2, "A"],
[9, 3, "B"],
[3, 3, "C"]
], columns=['data1', 'data2', 'Categories'])
#Plotting by seaborn
fig, axs = plt.subplots(1, 1)
sns.boxplot(data=df,x="Categories",y='data1',ax=axs)
fig.show()
plt.waitforbuttonpress()
plt.close(fig)
The code above generates:
Replacing "data1" with "data2" in the boxplot line would give:
What I want is something like this:
You need to melt (convert to long format) the DataFrame first:
data = df.melt(id_vars=['Categories'], var_name='dataset', value_name='values')
print(data)
Prints:
Categories dataset values
0 A data1 2.0
1 A data2 4.0
2 C data1 4.0
3 C data2 5.0
4 B data1 5.0
5 B data2 4.0
6 A data1 10.0
7 A data2 4.2
8 B data1 9.0
9 B data2 3.0
10 C data1 3.0
11 C data2 3.0
Now you just have to use dataset as the hue. Since the plot is quite busy I moved the legend outside it.
sns.boxplot(data=data, x='Categories', y='values', hue='dataset')
plt.legend(title='dataset', loc='upper left', bbox_to_anchor=(1, 1))
Edit by OP:
I implemented this in a function such that it makes the plot with as many columns as desired in an ax and returns it.
def box_plot_columns(df,categories_column,list_of_columns,legend_title,y_axis_title,**boxplotkwargs):
columns = [categories_column] + list_of_columns
newdf = df[columns].copy()
data = newdf.melt(id_vars=[categories_column], var_name=legend_title, value_name=y_axis_title)
return sns.boxplot(data=data, x=categories_column, y=y_axis_title, hue=legend_title, **boxplotkwargs)
Usage Example:
fig, ax = plt.subplots(1,1)
ax = box_plot_columns(Data,"Categories",["data1","data2"],"dataset","values",ax=ax)
ax.set_title("My Plot")
plt.show()
Try This :
df = pd.DataFrame(
[
[2, 4, "A"],
[4, 5, "C"],
[5, 4, "B"],
[10, 4.2, "A"],
[9, 3, "B"],
[3, 3, "C"]
], columns=['data1', 'data2', 'Categories'])
#Plotting by seaborn
df_c = pd.melt(df, "Categories", var_name="data1", value_name="data2")
sns.factorplot("Categories",hue="data1", y="data2", data=df_c, kind="box")