I have created the following dataframe based on a range of data.
df['data_classification'] = df.myDatarange.apply(lambda a:'Very good' if a>=-90
else ('Good' if (a>= -100 or a<=-91)
else ('Moderate' if (a>= -110 or a<=-101)
else ('Poor' if (a>= -123 or a<=-111)
else ('Bad' if (a>= -140 or a<=-124)
else 'Off' )))))
I am planning to plot myDatarange with data_classification and somehow show the relation with different colour. I am very confused how to plot this.
I can plot myDatarange as a single lineplot, but how to relate the two data?
So far, I have tried the following:
x1 = df1.index
y1 = df1.myDatarange
f, (ax1,ax2) = plt.subplots(2,figsize=(5, 5))
ax1.plot(x1,y1,color='red', linewidth=1.9, alpha=0.9, label="myDataRange")
plt.show()
How can I plot the above range of data based on classification as area plot? Is there a better way than area plot to express my data? There are examples on the net, but not very clear on conditional side of it.
Seaborn's barplot can take a hue parameter to color each bar corresponding to the 'data_classification'. The new 'data_classification' column can be created quicker and easier to modify via pd.cut.
The barplot can be used as background for the lineplot to show the classification of each value.
Here is an example to get you started:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'myDatarange': np.random.randint(-150, -50, size=50)})
ranges = [-10**6, -140, -123, -110, -100, -90, 10**6]
df['data_classification'] = pd.cut(df['myDatarange'], ranges, right=False,
labels=['Off', 'Bad', 'Poor', 'Moderate', 'Good', 'Very Good'])
fig, ax1 = plt.subplots(figsize=(12, 4))
ax1.plot(df.index, df['myDatarange'], color='blue', linewidth=2, alpha=0.9, label="myDataRange")
sns.barplot(x=df.index, y=[df['myDatarange'].min()] * len(df),
hue='data_classification', alpha=0.5, palette='inferno', dodge=False, data=df, ax=ax1)
for bar in ax1.patches: # optionally set the bars to fill the complete background, default seaborn sets the width to about 80%
bar.set_width(1)
plt.legend(bbox_to_anchor=(1.02, 1.05) , loc='upper left')
plt.tight_layout()
plt.show()
PS: If you want to the 0 at the bottom (now at the top due to the negative y-values), you could call ax.invert_yaxis().
Related
I have the following problem: I'm trying to overlay two plots: One Pandas plot via plot.area() for a dataframe, and a second plot that is a standard Matplotlib plot. Depending the coder order for those two, the Matplotlib plot is displayed only if the code is before the Pandas plot.area() on the same axes.
Example: I have a Pandas dataframe called revenue that has a DateTimeIndex, and a single column with "revenue" values (float). Separately I have a dataset called projection with data along the same index (revenue.index)
If the code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Pandas area plot
revenue.plot.area(ax = ax)
# Second -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
plt.tight_layout()
plt.show()
Then the only thing displayed is the pandas plot.area() like this:
1/ Pandas plot.area() and 2/ Matplotlib line plot
However, if the order of the plotting is reversed:
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
# Second -- Pandas area plot
revenue.plot.area(ax = ax)
plt.tight_layout()
plt.show()
Then the plots are overlayed properly, like this:
1/ Matplotlib line plot and 2/ Pandas plot.area()
Can someone please explain me what I'm doing wrong / what do I need to do to make the code more robust ? Kind TIA.
The values on the x-axis are different in both plots. I think DataFrame.plot.area() formats the DateTimeIndex in a pretty way, which is not compatible with pyplot.plot().
If you plot of the projection first, plot.area() can still plot the data and does not format the x-axis.
Mixing the two seems tricky to me, so I would either use pyplot or Dataframe.plot for both the area and the line:
import pandas as pd
from matplotlib import pyplot as plt
projection = [1000, 2000, 3000, 4000]
datetime_series = pd.to_datetime(["2021-12","2022-01", "2022-02", "2022-03"])
datetime_index = pd.DatetimeIndex(datetime_series.values)
revenue = pd.DataFrame({"value": [1200, 2200, 2800, 4100]})
revenue = revenue.set_index(datetime_index)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# Option 1: only pyplot
ax[0].fill_between(revenue.index, revenue.value)
ax[0].plot(revenue.index, projection, color='black', linewidth=3)
ax[0].set_title("Pyplot")
# Option 2: only DataFrame.plot
revenue["projection"] = projection
revenue.plot.area(y='value', ax=ax[1])
revenue.plot.line(y='projection', ax=ax[1], color='black', linewidth=3)
ax[1].set_title("DataFrame.plot")
The results then look like this, where DataFrame.plot gives a much cleaner looking result:
If you do not want the projection in the revenue DataFrame, you can put it in a separate DataFrame and set the index to match revenue:
projection_df = pd.DataFrame({"projection": projection})
projection_df = projection_df.set_index(datetime_index)
projection_df.plot.line(ax=ax[1], color='black', linewidth=3)
I have a dataset with a lot of categorical variables and a binary target variable. What package is available in Python or other opensource GUI-based software where I can scatterplot two categorical variables on the X and Y axis and use the target variable as hue?
I have looked at Seaborn's catplot, but for that, one axis has to be numerical while the other categorical. So it doesn't serve this case.
For example, you can use the following:
import seaborn as sns
data = sns.load_dataset('titanic')
Here are the plot features I want
X-axis - 'embark_town'
Y-axis - 'class'
hue - 'alive'
I am of the opinion that if you have to rearrange a seaborn graph substantially, you can also create this graph from scratch with matplotlib. This gives us the opportunity to have a different approach to display this categorical vs categorical plot:
import matplotlib.pyplot as plt
from matplotlib.markers import MarkerStyle
import numpy as np
#dataframe and categories
import seaborn as sns
df = sns.load_dataset('titanic')
X = "embark_town"
Y = "class"
H = "alive"
bin_dic = {0: "yes", 1: "no"}
#counting the X-Y-H category entries
plt_df = df.groupby([X, Y, H]).size().to_frame(name="vals").reset_index()
#figure preparation with grid and scaling
fig, ax = plt.subplots(figsize=(9, 6))
ax.set_ylim(plt_df[Y].unique().size-0.5, -0.5)
ax.set_xlim(-0.5, plt_df[X].unique().size+1.0)
ax.grid(ls="--")
#upscale factor for scatter marker size
scale=10000/plt_df.vals.max()
#left marker for category 0
ax.scatter(plt_df[plt_df[H]==bin_dic[0]][X],
plt_df[plt_df[H]==bin_dic[0]][Y],
s=plt_df[plt_df[H]==bin_dic[0]].vals*scale,
c=[(0, 0, 1, 0.5)], edgecolor="black", marker=MarkerStyle("o", fillstyle="left"),
label=bin_dic[0])
#right marker for category 1
ax.scatter(plt_df[plt_df[H]==bin_dic[1]][X],
plt_df[plt_df[H]==bin_dic[1]][Y],
s=plt_df[plt_df[H]==bin_dic[1]].vals*scale,
c=[(1, 0, 0, 0.5)], edgecolor="black", marker=MarkerStyle("o", fillstyle="right"),
label=bin_dic[1])
#legend entries for the two categories
l = ax.legend(title="Survived the catastrophe", ncol=2, framealpha=0, loc="upper right", columnspacing=0.1,labelspacing=1.5)
l.legendHandles[0]._sizes = l.legendHandles[1]._sizes = [800]
#legend entries representing sizes
bubbles_n=5
bubbles_min = 50*(1+plt_df.vals.min()//50)
bubbles_step = 10*((plt_df.vals.max()-bubbles_min)//(10*(bubbles_n-1)))
bubbles_x = plt_df[X].unique().size+0.5
for i, bubbles_y in enumerate(np.linspace(0.5, plt_df[Y].unique().size-1, bubbles_n)):
#plot each legend bubble to indicate different marker sizes
ax.scatter(bubbles_x,
bubbles_y,
s=(bubbles_min + i*bubbles_step) * scale,
c=[(1, 0, 1, 0.6)], edgecolor="black")
#and label it with a value
ax.annotate(bubbles_min+i*bubbles_step, xy=(bubbles_x, bubbles_y),
ha="center", va="center",
fontsize="large", fontweight="bold", color="white")
plt.show()
Seaborn supports, just like matplotlib, the plotting of categorical vs categorical variables. One can create semitransparent markers that allow to see both categories, although this might be difficult to distinguish from one marker if both are of similar size. The essential plot is rather easy - we transform the dataframe with groupby and size to count the entries per triplet embarking town - class - alive category, then create a scatterplot with count value as markersize. However, the legend entry is the complicated part here. Either the markersize is tiny in the plot or massive in the legend. I tried to balance this but I am not happy with the result. A lot of manual adjusting necessary here, so seaborn is no real advantage here. Any suggestions on how to simplify this within seaborn are welcome.
import seaborn as sns
import matplotlib.pyplot as plt
#dataframe and categories
df = sns.load_dataset('titanic')
X = "embark_town"
Y = "class"
H = "alive"
#counting the X-Y-H category entries
plt_df = df.groupby([X, Y, H]).size().to_frame(name="people").reset_index()
#figure preparation with grid and scaling
fig, ax = plt.subplots(figsize=(6,4))
ax.set_ylim(plt_df[Y].unique().size-0.5, -0.5)
ax.set_xlim(-0.5, plt_df[X].unique().size+1.0)
ax.grid(ls="--")
#the actual scatterplot with markersize representing the counted values
sns.scatterplot(x=X,
y=Y,
size="people",
sizes=(100, 10000),
alpha=0.5,
edgecolor="black",
hue=H,
data=plt_df,
ax=ax)
#creating two legends because the hue markers differ in size from the others
handles, labels = ax.get_legend_handles_labels()
l = ax.legend(handles[:3], labels[:3], title="The poor die first", markerscale=2, loc="upper right")
ax.add_artist(l)
#and seaborn plots the size markers in black, so you would get massive black blobs in the legend
#we change the color and make them transparent
for handle in handles:
handle.set_facecolors((0, 1, 1, 0.5))
ax.legend(handles[4::2], labels[4::2], title="N° of people", loc="lower right", handletextpad=4, labelspacing=3, markerfirst=False)
plt.tight_layout()
plt.show()
Sample output:
I have a dataframe(df) with two columns: 'Foundation Type', which has 4 types of foundations (Shafts, Piles, Combination, Spread), and another column 'Vs30' with different values for parameter Vs30. Each row represents a bridge, with a type of foundation and a Vs30 value.
First, I create an new column 'binVs30' in df, converting each element of 'Vs30' into different bins, which has 5 different kind of ranges ([0-200],[200-400]...[800-1000]).
df['binVs30'] = pd.cut(df.Vs30, bins=np.arange(0, 1100, 200))
then, I created a stacked area plot with the code as follow:
color_table = pd.crosstab(df['binVs30'], df['Foundation Type'], dropna=False)
ax = color_table.plot(kind='area', figsize=(8, 8), stacked=True, rot=0)
display(ax)
plt.xlabel('')
plt.ylabel('Frequency', fontsize=12)
plt.legend(title='Foundation Type', loc='upper right')
plt.title('Column Database', fontsize='20')
plt.show()
The resulting picture shows some extra bins that shouldn't be there. Therefore, I had to fix the xticks by manually adding the following code:
locs, labels = plt.xticks()
plt.xticks(locs, ['','0-200','','200-400','','400-600','','600-800','','800-1000'], fontsize=10, rotation=45)
Is there a reason why Python creates those extra bins that shouldn't exist? Is that a bug that Python has? Since if I change it to a stacked bar plot, the problem just vanished. Is there a way that I could fix it by not manually adding bin code?
Also two other questions are, how to add the edgecolor for an area plot? Something like:
color_table.plot(kind='area', figsize=(8, 8), stacked=True, edgecolor='black', legend=None, rot=0)
The command edgecolor='black' doesn't work in a stacked area plot.
And, if I want to create bin for 'Vs30' like ([0-200],[200-400]...[>800]). Is there a way I can do that? Since the way I create 'binVs30' column doesn't allow me create a bin that is '>800'.
There are a couple of questions here. Firstly about including an open-ended bin in your pd.cut(). You can use np.inf to capture everything in the last bin and assign it a custom label. Secondly, since you're already using matplotlib, I'd recommend using its stacking plot directly rather than via pandas. Then you can use edgecolor argument without any issues.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame(data={
"foundation" : np.random.choice(list("ABCD"), 1000),
"binVs30" : np.random.randint(0, 1200, 1000)
})
bins = [0, 200, 400, 600, 800, np.inf]
labels = ["0-199", "200-399", "400-599", "600-799", "800+"]
df["bins"] = pd.cut(
df["binVs30"], bins=bins, labels=labels,
right=False, include_lowest=True)
stack_data = pd.crosstab(df['bins'], df['foundation'], dropna=False)
stack_array = stack_data.values.T.tolist()
pal = sns.color_palette("Set1")
plt.figure(figsize=(8,4))
plt.stackplot(
labels, stack_array, labels=list("ABCD"),
colors=pal, alpha=0.4, edgecolor="black")
plt.legend(loc='upper left')
plt.show()
I am having an issue trying to superimpose plots with seaborn. I am able to generate the two plots separetly as
fig, (ax1,ax2) = plt.subplots(ncols=2,figsize=(30, 7))
sns.lineplot(data=data1, y='MSE',x='pct_gc',ax=ax1)
sns.boxplot(x="pct_gc", y="MSE", data=data2,ax=ax2,width=0.4)
The output looks like this:
But when i try to put both plots superimposed, but assiging both to the same ax object.
fig, (ax1,ax2) = plt.subplots(ncols=2,figsize=(30, 7))
sns.lineplot(data=data1, y='MSE',x='pct_gc',ax=ax1)
sns.boxplot(x="pct_gc", y="MSE", data=data2,ax=ax2,width=0.4)
I am not able to identify with the X axis in the Lineplot changes when superimposing both plots (both plots X axis go from 0 to 0.069).
My goal is for both plots to be superimposed, while keeping the same X axis range.
Seaborn's boxplot creates categorical x-axis, with all boxes nicely with the same distance. Internally the x-axis is numbered as 0, 1, 2, ... but externally it gets the labels from 0 to 0.069.
To combine a line plot with a boxplot, matplotlib's boxplot can be addressed directly, so that positions and widths can be set explicitly. When patch_artist=True, a rectangle is created (instead of just lines), for which a facecolor can be given. manage_ticks=False prevents that boxplot changes the x ticks and their limits. Optionally notch=True would accentuate the median a bit more, but depending on the data, the confidence interval might be too large and look weird.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
data1 = pd.DataFrame({'pct_gc': np.linspace(0, 0.069, 200), 'MSE': np.random.normal(0.02, 0.1, 200).cumsum()})
data1['pct_range'] = pd.cut(data1['pct_gc'], 10)
fig, ax1 = plt.subplots(ncols=1, figsize=(20, 7))
sns.lineplot(data=data1, y='MSE', x='pct_gc', ax=ax1)
for interval, color in zip(np.unique(data1['pct_range']), plt.cm.tab10.colors):
ax1.boxplot(data1[data1['pct_range'] == interval]['MSE'],
positions=[interval.mid], widths=0.4 * interval.length,
patch_artist=True, boxprops={'facecolor': color},
notch=False, medianprops={'color':'yellow', 'linewidth':2},
manage_ticks=False)
plt.show()
I created a Seaborn barplot using the code below (it comes from https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/)
I would like all the bars to stack up without whitespace, but have been unable to do so. If I add width it complains about multiple values for width in barh. This is probably as seaborn has its own algo to determine the width. Is there anyway around it?
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Read data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")
# Draw Plot
plt.figure(figsize=(13, 10), dpi=80)
group_col = 'Gender'
order_of_bars = df.Stage.unique()[::-1]
colors = [plt.cm.Spectral(i/float(len(df[group_col].unique())-1)) for i in
range(len(df[group_col].unique()))]
for c, group in zip(colors, df[group_col].unique()):
sns.barplot(x='Users', y='Stage', data=df.loc[df[group_col]==group, :],
order=order_of_bars, color=c, label=group)
# Decorations
plt.xlabel("$Users$")
plt.ylabel("Stage of Purchase")
plt.yticks(fontsize=12)
plt.title("Population Pyramid of the Marketing Funnel", fontsize=22)
plt.legend()
plt.show()
Not a matplotlib expert by any means, so there may be a better way to do this. Perhaps you can do something like the following, which is similar to the approach in this answer:
# Draw Plot
fig, ax = plt.subplots(figsize=(13, 10), dpi=80)
...
for c, group in zip(colors, df[group_col].unique()):
sns.barplot(x='Users', y='Stage', data=df.loc[df[group_col]==group, :],
order=order_of_bars, color=c, label=group, ax=ax)
# Adjust height
for patch in ax.patches:
current_height = patch.get_height()
patch.set_height(1)
patch.set_y(patch.get_y() + current_height - 1)