How to plot a stacked area plot - python

I have a dataframe(df) with two columns: 'Foundation Type', which has 4 types of foundations (Shafts, Piles, Combination, Spread), and another column 'Vs30' with different values for parameter Vs30. Each row represents a bridge, with a type of foundation and a Vs30 value.
First, I create an new column 'binVs30' in df, converting each element of 'Vs30' into different bins, which has 5 different kind of ranges ([0-200],[200-400]...[800-1000]).
df['binVs30'] = pd.cut(df.Vs30, bins=np.arange(0, 1100, 200))
then, I created a stacked area plot with the code as follow:
color_table = pd.crosstab(df['binVs30'], df['Foundation Type'], dropna=False)
ax = color_table.plot(kind='area', figsize=(8, 8), stacked=True, rot=0)
display(ax)
plt.xlabel('')
plt.ylabel('Frequency', fontsize=12)
plt.legend(title='Foundation Type', loc='upper right')
plt.title('Column Database', fontsize='20')
plt.show()
The resulting picture shows some extra bins that shouldn't be there. Therefore, I had to fix the xticks by manually adding the following code:
locs, labels = plt.xticks()
plt.xticks(locs, ['','0-200','','200-400','','400-600','','600-800','','800-1000'], fontsize=10, rotation=45)
Is there a reason why Python creates those extra bins that shouldn't exist? Is that a bug that Python has? Since if I change it to a stacked bar plot, the problem just vanished. Is there a way that I could fix it by not manually adding bin code?
Also two other questions are, how to add the edgecolor for an area plot? Something like:
color_table.plot(kind='area', figsize=(8, 8), stacked=True, edgecolor='black', legend=None, rot=0)
The command edgecolor='black' doesn't work in a stacked area plot.
And, if I want to create bin for 'Vs30' like ([0-200],[200-400]...[>800]). Is there a way I can do that? Since the way I create 'binVs30' column doesn't allow me create a bin that is '>800'.

There are a couple of questions here. Firstly about including an open-ended bin in your pd.cut(). You can use np.inf to capture everything in the last bin and assign it a custom label. Secondly, since you're already using matplotlib, I'd recommend using its stacking plot directly rather than via pandas. Then you can use edgecolor argument without any issues.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame(data={
"foundation" : np.random.choice(list("ABCD"), 1000),
"binVs30" : np.random.randint(0, 1200, 1000)
})
bins = [0, 200, 400, 600, 800, np.inf]
labels = ["0-199", "200-399", "400-599", "600-799", "800+"]
df["bins"] = pd.cut(
df["binVs30"], bins=bins, labels=labels,
right=False, include_lowest=True)
stack_data = pd.crosstab(df['bins'], df['foundation'], dropna=False)
stack_array = stack_data.values.T.tolist()
pal = sns.color_palette("Set1")
plt.figure(figsize=(8,4))
plt.stackplot(
labels, stack_array, labels=list("ABCD"),
colors=pal, alpha=0.4, edgecolor="black")
plt.legend(loc='upper left')
plt.show()

Related

How to reduce the blank area in a grouped boxplot with many missing hue categories

I have an issue when plotting a categorical grouped boxplot by seaborn in Python, especially using 'hue'.
My raw data is as shown in the figure below. And I wanted to plot values in column 8 after categorized by column 1 and 4.
I used seaborn and my code is shown below:
ax = sns.boxplot(x=output[:,1], y=output[:,8], hue=output[:,4])
ax.set_xticklabel(ax.get_xticklabels(), rotation=90)
plt.legend([],[])
However, the generated plot always contains large blank area, as shown in the upper figure below. I tried to add 'dodge=False' in sns.boxplot according to a post here (https://stackoverflow.com/questions/53641287/off-center-x-axis-in-seaborn), but it gives the lower figure below.
Actually, what I want Python to plot is a boxplot like what I generated using JMP below.
It seems that if one of the 2nd categories is empty, seaborn will still leave the space on the generated figure for each 1st category, thus causes the observed off-set/blank area.
So I wonder if there is any way to solve this issue, like using other package in python?
Seaborn reserves a spot for each individual hue value, even when some of these values are missing. When many hue values are missing, this leads to annoying open spots. (When there would be only one box per x-value, dodge=False would solve the problem.)
A workaround is to generate a separate subplot for each individual x-label.
Reproducible example for default boxplot with missing hue values
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(20230206)
df = pd.DataFrame({'label': np.repeat(['label1', 'label2', 'label3', 'label4'], 250),
'cat': np.repeat(np.random.choice([*'abcdefghijklmnopqrst'], 40), 25),
'value': np.random.randn(1000).cumsum()})
df['cat'] = pd.Categorical(df['cat'], [*'abcdefghijklmnopqrst'])
sns.set_style('white')
plt.figure(figsize=(15, 5))
ax = sns.boxplot(df, x='label', y='value', hue='cat', palette='turbo')
sns.move_legend(ax, loc='upper left', bbox_to_anchor=(1, 1), ncol=2)
sns.despine()
plt.tight_layout()
plt.show()
Individual subplots per x value
A FacetGrid is generated with a subplot ("facet") for each x value
The original hue will be used as x-value for each subplot. To avoid empty spots, the hue should be of string type. When the hue would be pd.Categorical, seaborn would still reserve a spot for each of the categories.
df['cat'] = df['cat'].astype(str) # the column should be of string type, not pd.Categorical
g = sns.FacetGrid(df, col='label', sharex=False)
g.map_dataframe(sns.boxplot, x='cat', y='value')
for label, ax in g.axes_dict.items():
ax.set_title('') # remove the title generated by sns.FacetGrid
ax.set_xlabel(label) # use the label from the dataframe as xlabel
plt.tight_layout()
plt.show()
Adding consistent coloring
A dictionary palette can color the boxes such that corresponding boxes in different subplots have the same color. hue= with the same column as the x= will do the coloring, and dodge=False will remove the empty spots.
df['cat'] = df['cat'].astype(str) # the column should be of string type, not pd.Categorical
cats = np.sort(df['cat'].unique())
palette_dict = {cat: color for cat, color in zip(cats, sns.color_palette('turbo', len(cats)))}
g = sns.FacetGrid(df, col='label', sharex=False)
g.map_dataframe(sns.boxplot, x='cat', y='value',
hue='cat', dodge=False, palette=palette_dict)
for label, ax in g.axes_dict.items():
ax.set_title('') # remove the title generated by sns.FacetGrid
ax.set_xlabel(label) # use the label from the dataframe as xlabel
# ax.tick_params(axis='x', labelrotation=90) # optionally rotate the tick labels
plt.tight_layout()
plt.show()

Overlaying Pandas plot with Matplotlib is sensitive to the plotting order

I have the following problem: I'm trying to overlay two plots: One Pandas plot via plot.area() for a dataframe, and a second plot that is a standard Matplotlib plot. Depending the coder order for those two, the Matplotlib plot is displayed only if the code is before the Pandas plot.area() on the same axes.
Example: I have a Pandas dataframe called revenue that has a DateTimeIndex, and a single column with "revenue" values (float). Separately I have a dataset called projection with data along the same index (revenue.index)
If the code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Pandas area plot
revenue.plot.area(ax = ax)
# Second -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
plt.tight_layout()
plt.show()
Then the only thing displayed is the pandas plot.area() like this:
1/ Pandas plot.area() and 2/ Matplotlib line plot
However, if the order of the plotting is reversed:
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
# Second -- Pandas area plot
revenue.plot.area(ax = ax)
plt.tight_layout()
plt.show()
Then the plots are overlayed properly, like this:
1/ Matplotlib line plot and 2/ Pandas plot.area()
Can someone please explain me what I'm doing wrong / what do I need to do to make the code more robust ? Kind TIA.
The values on the x-axis are different in both plots. I think DataFrame.plot.area() formats the DateTimeIndex in a pretty way, which is not compatible with pyplot.plot().
If you plot of the projection first, plot.area() can still plot the data and does not format the x-axis.
Mixing the two seems tricky to me, so I would either use pyplot or Dataframe.plot for both the area and the line:
import pandas as pd
from matplotlib import pyplot as plt
projection = [1000, 2000, 3000, 4000]
datetime_series = pd.to_datetime(["2021-12","2022-01", "2022-02", "2022-03"])
datetime_index = pd.DatetimeIndex(datetime_series.values)
revenue = pd.DataFrame({"value": [1200, 2200, 2800, 4100]})
revenue = revenue.set_index(datetime_index)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# Option 1: only pyplot
ax[0].fill_between(revenue.index, revenue.value)
ax[0].plot(revenue.index, projection, color='black', linewidth=3)
ax[0].set_title("Pyplot")
# Option 2: only DataFrame.plot
revenue["projection"] = projection
revenue.plot.area(y='value', ax=ax[1])
revenue.plot.line(y='projection', ax=ax[1], color='black', linewidth=3)
ax[1].set_title("DataFrame.plot")
The results then look like this, where DataFrame.plot gives a much cleaner looking result:
If you do not want the projection in the revenue DataFrame, you can put it in a separate DataFrame and set the index to match revenue:
projection_df = pd.DataFrame({"projection": projection})
projection_df = projection_df.set_index(datetime_index)
projection_df.plot.line(ax=ax[1], color='black', linewidth=3)

set custom tick labels on heatmap color bar

I have a list of dataframes named merged_dfs that I am looping through to get the correlation and plot subplots of heatmap correlation matrix using seaborn.
I want to customize the colorbar tick labels, but I am having trouble figuring out how to do it with my example.
Currently, my colorbar scale values from top to bottom are
[1,0.5,0,-0.5,-1]
I want to keep these values, but change the tick labels to be
[1,0.5,0,0.5,1]
for my diverging color bar.
Here is the code and my attempt:
fig, ax = plt.subplots(nrows=6, ncols=2, figsize=(20,20))
for i, (title,merging) in enumerate (zip(new_name_data,merged_dfs)):
graph = merging.corr()
colormap = sns.diverging_palette(250, 250, as_cmap=True)
a = sns.heatmap(graph.abs(), cmap=colormap, vmin=-1,vmax=1,center=0,annot = graph, ax=ax.flat[i])
cbar = fig.colorbar(a)
cbar.set_ticklabels(["1","0.5","0","0.5","1"])
fig.delaxes(ax[5,1])
plt.show()
plt.close()
I keep getting this error:
AttributeError: 'AxesSubplot' object has no attribute 'get_array'
Several things are going wrong:
fig.colorbar(...) would create a new colorbar, by default appended to the last subplot that was created.
sns.heatmap returns an ax (indicates a subplot). This is very different to matplotlib functions, e.g. plt.imshow(), which would return the graphical element that was plotted.
You can suppress the heatmap's colorbar (cbar=False), and then create it newly with the parameters you want.
fig.colorbar(...) needs a parameter ax=... when the figure contains more than one subplot.
Instead of creating a new colorbar, you can add the colorbar parameters to sns.heatmap via cbar_kws=.... The colorbar itself can be found via ax.collections[0].colobar. (ax.collections[0] is where matplotlib stored the graphical object that contains the heatmap.)
Using an index is strongly discouraged when working with Python. It's usually more readable, easier to maintain and less error-prone to include everything into the zip command.
As now your vmin now is -1, taking the absolute value for the coloring seems to be a mistake.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
merged_dfs = [pd.DataFrame(data=np.random.rand(5, 7), columns=[*'ABCDEFG']) for _ in range(5)]
new_name_data = [f'Dataset {i + 1}' for i in range(len(merged_dfs))]
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 7))
for title, merging, ax in zip(new_name_data, merged_dfs, axes.flat):
graph = merging.corr()
colormap = sns.diverging_palette(250, 250, as_cmap=True)
sns.heatmap(graph, cmap=colormap, vmin=-1, vmax=1, center=0, annot=True, ax=ax, cbar_kws={'ticks': ticks})
ax.collections[0].colorbar.set_ticklabels([abs(t) for t in ticks])
fig.delaxes(axes.flat[-1])
fig.tight_layout()
plt.show()

Seaborn heatmap widths do not match when using subplots

I am trying to adjust the width of my second subplot (column sum with the binary cmap) to the first one.
So far I only managed to do so by randomly selecting different figsize, but every time I trying to re-use the code on a dataset of different size I alwayse come up with something like the picture below (second heatmap always wider than the first one).
Am I missing something to adjust the second one automatically ?
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
test = pd.DataFrame({'user': ['Bob', 'Bob', 'Bob','Janice','Janice','Fernand','Kevin','Sidhant'],
'tag' : ['enfant','enfant','enfant','femme','femme','jeune','jeune','jeune'],
'income': [3, 5, 1,14,8,10,13,17]})
# specify font sizes for later:
titlesize= 30
ticklabel = 23
legendlabel = 23
# Generate custom diverging colormaps:
cmap = sns.color_palette("ch:18,-.1,dark=.3", 6)
cmap2 = sns.color_palette("binary", 6)
# Preparing data for the heatmap:
heatmap1_data = pd.pivot_table(test, values='income',
index=['user'],
columns='tag')
heatmap1_data = heatmap1_data.reindex(heatmap1_data.sum().sort_values(ascending=False).index, axis=1)
# Creating figure:
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(10,15))
# First subplot:
sns.heatmap(heatmap1_data, ax= ax1, cmap=cmap,square=True, linewidths=.5, annot=True, cbar = False,annot_kws={"size": legendlabel} )
# Cosmetic first subplot:
ax1.xaxis.tick_top()
ax1.tick_params(labelsize= ticklabel, top = False)
ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.set_xticklabels(heatmap1_data.columns,rotation=90)
ax1.set_yticklabels(heatmap1_data.index,rotation=0)
ax1.set_title("Activités par agence et population vulnérable", size= titlesize, pad=20)
# Second subplot (column sum at the bottom):
sns.heatmap((pd.DataFrame(heatmap1_data.sum(axis=0))).transpose().round(1), ax=ax2, square=True, fmt='g', linewidths=.5, annot=True, cmap=cmap2 , cbar=False, xticklabels=False, yticklabels=False, annot_kws={"size": legendlabel})
ax2.set_xlabel("Nombre d'activités", size = ticklabel, labelpad = 5)
# More cosmetic:
ax1.set_title("Title", size= titlesize, pad=35)
ax1.set_xlabel('')
ax1.set_ylabel('')
plt.tick_params(labelsize= ticklabel,left=False, bottom=False)
plt.xticks(rotation=60)
ax1.spines['bottom'].set_color('#dfe1ec')
ax1.spines['left'].set_color('#dfe1ec')
ax1.spines['top'].set_color('#dfe1ec')
ax1.spines['right'].set_color('#dfe1ec')
plt.tight_layout()
plt.show()
The issue is using square=True in sns.heatmap. Since the aspect ratios of the two subplots are wide vs tall, the way that the "squaring" is done is different for each. For the first, it's made thinner, and the second, it's made shorter. It's done this way to fit into the constraints of the your subplot Axes' sizes, which are defined to be equal by default when you call plt.subplots.
One way to get around this is to define the aspect ratios of your two Axes to be different and fit the shape of your data. This won't work 100 % of the time but will in most cases. You can use the keyword gridspec_kw and define a dictionary with 'height_ratios' in your call of plt.subplots.
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(10,15), gridspec_kw={'height_ratios':[5, 1]})

how to perform conditional area plotting with matplotlib?

I have created the following dataframe based on a range of data.
df['data_classification'] = df.myDatarange.apply(lambda a:'Very good' if a>=-90
else ('Good' if (a>= -100 or a<=-91)
else ('Moderate' if (a>= -110 or a<=-101)
else ('Poor' if (a>= -123 or a<=-111)
else ('Bad' if (a>= -140 or a<=-124)
else 'Off' )))))
I am planning to plot myDatarange with data_classification and somehow show the relation with different colour. I am very confused how to plot this.
I can plot myDatarange as a single lineplot, but how to relate the two data?
So far, I have tried the following:
x1 = df1.index
y1 = df1.myDatarange
f, (ax1,ax2) = plt.subplots(2,figsize=(5, 5))
ax1.plot(x1,y1,color='red', linewidth=1.9, alpha=0.9, label="myDataRange")
plt.show()
How can I plot the above range of data based on classification as area plot? Is there a better way than area plot to express my data? There are examples on the net, but not very clear on conditional side of it.
Seaborn's barplot can take a hue parameter to color each bar corresponding to the 'data_classification'. The new 'data_classification' column can be created quicker and easier to modify via pd.cut.
The barplot can be used as background for the lineplot to show the classification of each value.
Here is an example to get you started:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'myDatarange': np.random.randint(-150, -50, size=50)})
ranges = [-10**6, -140, -123, -110, -100, -90, 10**6]
df['data_classification'] = pd.cut(df['myDatarange'], ranges, right=False,
labels=['Off', 'Bad', 'Poor', 'Moderate', 'Good', 'Very Good'])
fig, ax1 = plt.subplots(figsize=(12, 4))
ax1.plot(df.index, df['myDatarange'], color='blue', linewidth=2, alpha=0.9, label="myDataRange")
sns.barplot(x=df.index, y=[df['myDatarange'].min()] * len(df),
hue='data_classification', alpha=0.5, palette='inferno', dodge=False, data=df, ax=ax1)
for bar in ax1.patches: # optionally set the bars to fill the complete background, default seaborn sets the width to about 80%
bar.set_width(1)
plt.legend(bbox_to_anchor=(1.02, 1.05) , loc='upper left')
plt.tight_layout()
plt.show()
PS: If you want to the 0 at the bottom (now at the top due to the negative y-values), you could call ax.invert_yaxis().

Categories

Resources