I have a dataframe where each column contains values considered "normal" if they fall within an interval, which is different for every column:
# The main df
df = pd.DataFrame({"A": [20, 10, 7, 39],
"B": [1, 8, 12, 9],
"C": [780, 800, 1200, 250]})
The df_info represents the intervals for each column of df.
So for example df_info["A"][0] is the min for the column df["A"] and df_info["A"][1] represents the max for the column df["A"] and so on.
df_info = pd.DataFrame({"A": [22, 35],
"B": [5, 10],
"C": [850, 900]})
Thanks to this SO Answer I was able to create a custom heatmap to print in blue values below the range, in red value above the range and in white values within the range. Just remember each column has a different range. SO i normalized according to this:
df_norm = pd.DataFrame()
for col in df:
col_min = df_info[col][0]
col_max = df_info[col][1]
df_norm[col] = (df[col] - col_min) / (col_max - col_min)
And finally printed my heatmap
vmin = df_norm.min().min()
vmax = df_norm.max().max()
norm_zero = (0 - vmin) / (vmax - vmin)
norm_one = (1 - vmin) / (vmax - vmin)
colors = [[0, 'darkblue'],
[norm_zero, 'white'],
[norm_one, 'white'],
[1, 'darkred']
]
cmap = LinearSegmentedColormap.from_list('', colors, )
fig, ax = plt.subplots()
ax=sns.heatmap(data=data,
annot=True,
annot_kws={'size': 'large'},
mask=None,
cmap=cmap,
vmin=vmin,
vmax=vmax) \
.set_facecolor('white')
In the example you can see that the third column has values much higher/lower compared to the the 0-1 interval (and to the first column) so they "absorb" all the shades of red and blue.
QUESTION:
What I want to obtain is use the entire shades of red/blue for each column or at least to reduce the perceptual difference between (for example) the first and third column.
I had tough of:
create a custom colormap where each colormap normalization is performed by column
use multiple colormaps, each one applied to a different column
applying a colormap mpl.colors.LogNorm but I'm not sure how to use it with my custom LinearSegmentedColormap
Using a mask per column, you could draw the heatmap column per column, each with its own colormap:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.cm import ScalarMappable
df = pd.DataFrame({"A": [20, 10, 7, 39],
"B": [1, 8, 12, 9],
"C": [780, 800, 1200, 250]})
df_info = pd.DataFrame({"A": [22, 35],
"B": [5, 10],
"C": [850, 900]})
df_norm = pd.DataFrame()
for col in df:
col_min = df_info[col][0]
col_max = df_info[col][1]
df_norm[col] = (df[col] - col_min) / (col_max - col_min)
fig, ax = plt.subplots()
for col in df:
vmin = df_norm[col].min()
vmax = df_norm[col].max()
norm_zero = (0 - vmin) / (vmax - vmin)
norm_one = (1 - vmin) / (vmax - vmin)
colors = [[0, 'darkblue'],
[norm_zero, 'white'],
[norm_one, 'white'],
[1, 'darkred']]
cmap = LinearSegmentedColormap.from_list('', colors)
mask = df.copy()
for col_m in mask:
mask[col_m] = col != col_m
sns.heatmap(data=df_norm,
annot=df.to_numpy(), annot_kws={'size': 'large'}, fmt="g",
mask=mask,
cmap=cmap, vmin=vmin, vmax=vmax, cbar=False, ax=ax)
ax.set_facecolor('white')
colors = [[0, 'darkblue'],
[1 / 3, 'white'],
[2 / 3, 'white'],
[1, 'darkred']]
cmap = LinearSegmentedColormap.from_list('', colors)
cbar = plt.colorbar(ScalarMappable(cmap=cmap), ax=ax, ticks=[0, 1 / 3, 2 / 3, 1])
cbar.ax.yaxis.set_ticklabels(['min\nlimit', 'min', 'max', 'max\nlimit'])
plt.tight_layout()
plt.show()
You can re-scale your df_norm before plotting:
# alternative method to scale
df_norm = (df - df_info.iloc[0])/(df_info.iloc[1]-df_info.iloc[0])
# scale the norm
df_plot = (df_norm - df_norm.min())/(df_norm.max()-df_norm.min())
# heat map on the normalized `df_plot`
# use values in `df_norm` to annotate
# color bar doesn't make sense so we remove it
sns.heatmap(df_plot, annot=df_norm, cmap='RdBu_r', cbar=False))
Output:
Related
I have the following dataframe:
d = {'a': [2, 3, 4.5], 'b': [3, 2, 5]}
df = pd.DataFrame(data=d, index=["val1", "val2","val3"])
df.head()
a b
val1 2.0 3
val2 3.0 2
val3 4.5 5
I plotted this dataframe with the following code:
fig, ax=plt.subplots(figsize=(10,10))
ax.scatter(df["a"], df["b"],s=1)
x1=[0, 2512]
y1=[0, 2512]
ax.plot(x1,y1, 'r-')
#set limits:
ax = plt.gca()
ax.set_xlim([0, 10])
ax.set_ylim([0, 10])
#add labels:
TEXTS = []
for idx, names in enumerate(df.index.values):
x, y = df["a"].iloc[idx], df["b"].iloc[idx]
TEXTS.append(ax.text(x, y, names, fontsize=12));
# Adjust text position and add lines
adjust_text(
TEXTS,
expand_points=(2.5, 2.5),
expand_text=(2.5,2),
autoalign="xy",
arrowprops=dict(arrowstyle="-", lw=1),
ax=ax
);
However, I can not find a way to push the labels away from the red diagonal line, in order to get this result:
You can use the regular matplotlib annotate function and change the direction of the offset depending on the position of the data point relative to the red line:
ax = df.plot.scatter('a', 'b')
ax.set_aspect(1)
ax.plot((0,10), (0,10), 'r-')
offset = np.array([-1, 1])
for s, xy in df.iterrows():
xy = xy.to_numpy()
direction = 1 if xy[1] > xy[0] else -1
ax.annotate(s, xy, xy + direction * offset, ha='center', va='center', arrowprops=dict(arrowstyle='-', lw=1))
started learning how to plot data on python and I need help achieving the following:
I have the following example df6:
df6 = pd.DataFrame({
'emails': [50, 60 ,30, 40, 90, 10, 0,85 ],
'delivered': [20, 16 ,6, 15, 66, 6, 0,55 ]
})
df6
Looks like:
emails delivered
0 50 20
1 60 16
2 30 6
3 40 15
4 90 66
5 10 6
6 0 0
7 85 55
I need to plot emails VS delivered in a 4 quadrant chart. X & Y range will be slightly extra of the max and the cross section will be the means of both columns.
What I did so far, used describe() to get the values of the df6 then:
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.axhline(y=45.6, color="black", linestyle="--")
plt.axvline(x=23, color="black", linestyle="--")
plt.plot(df6['delivered'],df6['emails'],"o")
plt.xlim([0, df6['delivered'].max()+20])
plt.ylim([0, df6['emails'].max()+20])
plt.show()
I got the following output so far:
What I am looking for is seeing the chart into just 4 groups scattered and label each group with the total count of one quarter:
I found it easier to normalize the data before plotting... UPDATE: Messed something up with counts, but the code is here to analyze my mistake.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scale = scaler.fit(df6)
# normalize the sen_matrix
norm_df = pd.DataFrame(scale.transform(df6), columns=df6.columns)
quadrant_1 = sum(np.logical_and(norm_df['emails'] < 0, norm_df['delivered'] < 0))
display(quadrant_1)
quadrant_2 = sum(np.logical_and(norm_df['emails'] > 0, norm_df['delivered'] < 0))
display(quadrant_2)
quadrant_3 = sum(np.logical_and(norm_df['emails'] < 0, norm_df['delivered'] > 0))
display(quadrant_3)
quadrant_4 = sum(np.logical_and(norm_df['emails'] > 0, norm_df['delivered'] > 0))
display(quadrant_4)
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.axhline(y=0, color="black", linestyle="--")
plt.axvline(x=0, color="black", linestyle="--")
plt.plot(norm_df['delivered'],norm_df['emails'],"o")
plt.gca().spines['bottom'].set_visible(False)
plt.gca().spines['left'].set_visible(False)
plt.gca().axes.get_xaxis().set_visible(False)
plt.gca().axes.get_yaxis().set_visible(False)
plt.text(0,-2.1,'Delivered',horizontalalignment='center', verticalalignment='center')
plt.text(-2.1,0,'Emails', horizontalalignment='center', verticalalignment='center', rotation=90)
plt.text(1,1,'Count: ' + str(quadrant_1),horizontalalignment='center', verticalalignment='center')
plt.text(-1,1,'Count: ' + str(quadrant_2), horizontalalignment='center', verticalalignment='center')
plt.text(-1,-1,'Count: ' + str(quadrant_3),horizontalalignment='center', verticalalignment='center')
plt.text(1,-1,'Count: ' + str(quadrant_4), horizontalalignment='center', verticalalignment='center')
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.show()
So to use the means in your plots you can start by simply modifying these 2 lines:
plt.axhline(y=df6['emails'].mean(), color="black", linestyle="--")
plt.axvline(x=df6['delivered'].mean(), color="black", linestyle="--")
We can then use pd.value_counts to compute the counts:
counts = df6.transform(lambda s: s >= s.mean()).value_counts()
pos = df6.agg(['min', 'max'])
Here counts contains the values of each pair of above/below means:
emails delivered
False False 4
True False 2
True 2
and pos contains the x/y (or email/delivered) coordinates at which the boxes are placed:
emails delivered
min 0 0
max 90 66
So you can adjust pos to change the annotation placement.
Finally you want to do the annotation on the figure:
for (eml, dlv), num in counts.iteritems():
ax.text(s=f'count: {num}',
x=pos.loc['max' if dlv else 'min', 'delivered'],
y=pos.loc['max' if eml else 'min', 'emails'],
ha='right' if dlv else 'left',
va='top' if eml else 'bottom',
)
Your are just missing the code for setting your left/bottom-spines position
import pandas as pd, numpy as np
df6 = pd.DataFrame({'emails': [50, 60 ,30, 40, 90, 10, 0,85 ],
'delivered': [20, 16 ,6, 15, 66, 6, 0,55 ]})
plt.plot(df6['delivered'],df6['emails'],"o")
count = np.count_nonzero(
(df6['emails'] < df6['delivered'].mean())&
(df6['delivered'] < df6['emails'].mean()) )
plt.annotate('count: %s'%count,(5,60))
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_position(('data',df6['delivered'].mean()))
plt.gca().spines['bottom'].set_position(('data',df6['emails'].mean()))
Here's another solution, with a more symmetric looking plot:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{
"emails": [50, 60, 30, 40, 90, 10, 0, 85],
"delivered": [20, 16, 6, 15, 66, 6, 0, 55],
}
)
plt.plot(df["delivered"], df["emails"], "o")
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
plt.gca().spines["left"].set_position(("data", df["delivered"].mean()))
plt.gca().spines["bottom"].set_position(("data", df["emails"].mean()))
def get_lims(df, column, w=0.1):
mean = df[column].mean()
max_diff = max(
abs(df[column].max() - mean),
abs(df[column].min() - mean),
)
return [mean - max_diff - max_diff * w, mean + max_diff + max_diff * w]
plt.xlim(get_lims(df, "delivered"))
plt.ylim(get_lims(df, "emails"))
plt.show()
How do I add conditional coloring to this table?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'A':[16, 15, 14, 16],
'B': [3, -2, 5, 0],
'C': [200000, 3, 6, 800000],
'D': [51, -6, 3, 2]})
fig, ax = plt.subplots(figsize=(10,5))
ax.axis('tight')
ax.axis('off')
the_table = ax.table(cellText = df.values, colLabels = df.columns, loc='center')
plt.show()
How do I add conditional coloring to the table where column A and column D values are greater than or equal to 15, the cells are red; else they're green. If column B and column C values are greater than or equal to 5, the cells are red; else they're green. This is what it should look like:
Generate a list of lists and feed it to cellColours. Make sure that the list of lists contains as many lists as you have rows in the data frame and each of the lists within the list of lists contains as many strings as you have columns in the data frame.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'A':[16, 15, 14, 16],
'B': [3, -2, 5, 0],
'C': [200000, 3, 6, 800000],
'D': [51, -6, 3, 2]})
colors = []
for _, row in df.iterrows():
colors_in_column = ["g", "g", "g", "g"]
if row["A"]>=15:
colors_in_column[0] = "r"
if row["B"]>=5:
colors_in_column[1] = "r"
if row["C"]>5:
colors_in_column[2] = "r"
if row["D"]>=15:
colors_in_column[3] = "r"
colors.append(colors_in_column)
fig, ax = plt.subplots(figsize=(10,5))
ax.axis('tight')
ax.axis('off')
the_table = ax.table(cellText = df.values, colLabels = df.columns, loc='center', cellColours=colors)
plt.show()
I have the following MWE, which plots two columns of a pandas dataframe in one single plot where each column has its own y-axis:
df = pd.DataFrame({'t': [2000, 2002, 2004, 2006],
'a': [2, 4, 6, 8],
'b': [100, 200, 300, 400]})
fig = plt.figure(figsize=(10, 10))
plt.xticks(np.arange(2000, 2020, 2))
ax1 = df['b'].plot(label="b")
ax1.set_ylabel("b")
ax1.set_ylim(0, 500)
ax2 = df['a'].plot(secondary_y=True, label="a")
ax2.set_ylabel("a")
ax2.set_ylim(0, 5)
handles, labels = [], []
for ax in fig.axes:
for h, l in zip(*ax.get_legend_handles_labels()):
handles.append(h)
labels.append(l)
plt.legend(handles, labels)
However, the x-ticks are missing although I have tried to add them with this line of code: plt.xticks(np.arange(2000, 2020, 2)).
What command do I need to add them besides what I already have?
plt.xticks(np.arange(2000, 2020, 2)) sets the ticks to be at positions 2000, 2002, etc.
However your plot ranges from 0 to 4, because that is the index of the dataframe.
Either set the index to the values of the "t" column,
df.set_index("t", inplace=True)
ax1 = df['b'].plot(label="b")
or plot the columns directly
ax1 = df.plot(x="t", y="b", label="b")
You need to specify the limit of x axis. The following solution can help.
df = pd.DataFrame({'t': [2000, 2002, 2004, 2006],
'a': [2, 4, 6, 8],
'b': [100, 200, 300, 400]})
fig = plt.figure(figsize=(10, 10))
plt.xticks(np.arange(2000, 2020, 2))
ax1 = df['b'].plot(label="b")
ax1.set_ylabel("b")
ax1.set_ylim(0, 500)
ax2 = df['a'].plot(secondary_y=True, label="a")
ax2.set_ylabel("a")
ax2.set_ylim(0, 5)
ax3 = df['t'].plot(label="t")
ax3.set_xlabel("t")
ax3.set_xlim(2000,2020)
handles, labels = [], []
for ax in fig.axes:
for h, l in zip(*ax.get_legend_handles_labels()):
handles.append(h)
labels.append(l)
plt.legend(handles, labels)
When plotting 2 columns from a dataframe into a line plot, is it possible to, instead of a consistently increasing scale, have fixed values on your y axis (and keep the distances between the numbers on the axis constant)? For example, instead of 0, 100, 200, 300, ... to have 0, 21, 53, 124, 287, depending on the values from your dataset? So basically to have on the axis all your possible values fixed instead of an increasing scale?
Yes, you can use: ax.set_yticks()
Example:
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yticks(y)
plt.show()
Or if the values are very distant each other, you can use ax.set_yscale('log').
Example:
df = pd.DataFrame([[13, 1], [14, 1.5], [15, 1.8], [16, 2], [17, 2], [18, 3 ], [19, 3.6], [20, 300]], columns = ['A','B'])
fig, ax = plt.subplots()
x = df['A']
y = df['B']
ax.plot(x, y, 'g-')
ax.set_yscale('log', basex=2)
ax.yaxis.set_ticks(y)
ax.yaxis.set_ticklabels(y)
plt.show()
What you need to do is:
get all distinct y values and sort them
set their y position on the plot according to their place on the ordered list
set the y labels according to distinct ordered values
The code below would do
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame([[13, 1], [14, 1.8], [16, 2], [15, 1.5], [17, 2], [18, 3 ],
[19, 200],[20, 3.6], ], columns = ['A','B'])
x = df['A']
y = df['B']
y_keys = np.sort(y.unique())
y_values = range(len(y_keys))
y_dict = dict(zip(y_keys,y_values))
fig, ax = plt.subplots()
ax.plot(x,[y_dict[k] for k in y],'o-')
ax.set_yticks(y_values)
ax.set_yticklabels(y_keys)