Two and three dimensional data can be viewed relatively straight-forwardly using traditional plot types. Even with four dimensional data, we can often find a way to display the data. Dimensions above four, though, become increasingly difficult to display. Fortunately, parallel coordinates plots provide a mechanism for viewing results with higher dimensions.
Several plotting packages provide parallel coordinates plots, such as Matlab, R, VTK type 1 and VTK type 2, but I don't see how to create one using Matplotlib.
Is there a built-in parallel coordinates plot in Matplotlib? I certainly don't see one in the gallery.
If there is no built-in-type, is it possible to build a parallel coordinates plot using standard features of Matplotlib?
Edit:
Based on the answer provided by Zhenya below, I developed the following generalization that supports an arbitrary number of axes. Following the plot style of the example I posted in the original question above, each axis gets its own scale. I accomplished this by normalizing the data at each axis point and making the axes have a range of 0 to 1. I then go back and apply labels to each tick-mark that give the correct value at that intercept.
The function works by accepting an iterable of data sets. Each data set is considered a set of points where each point lies on a different axis. The example in __main__ grabs random numbers for each axis in two sets of 30 lines. The lines are random within ranges that cause clustering of lines; a behavior I wanted to verify.
This solution isn't as good as a built-in solution since you have odd mouse behavior and I'm faking the data ranges through labels, but until Matplotlib adds a built-in solution, it's acceptable.
#!/usr/bin/python
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
def parallel_coordinates(data_sets, style=None):
dims = len(data_sets[0])
x = range(dims)
fig, axes = plt.subplots(1, dims-1, sharey=False)
if style is None:
style = ['r-']*len(data_sets)
# Calculate the limits on the data
min_max_range = list()
for m in zip(*data_sets):
mn = min(m)
mx = max(m)
if mn == mx:
mn -= 0.5
mx = mn + 1.
r = float(mx - mn)
min_max_range.append((mn, mx, r))
# Normalize the data sets
norm_data_sets = list()
for ds in data_sets:
nds = [(value - min_max_range[dimension][0]) /
min_max_range[dimension][2]
for dimension,value in enumerate(ds)]
norm_data_sets.append(nds)
data_sets = norm_data_sets
# Plot the datasets on all the subplots
for i, ax in enumerate(axes):
for dsi, d in enumerate(data_sets):
ax.plot(x, d, style[dsi])
ax.set_xlim([x[i], x[i+1]])
# Set the x axis ticks
for dimension, (axx,xx) in enumerate(zip(axes, x[:-1])):
axx.xaxis.set_major_locator(ticker.FixedLocator([xx]))
ticks = len(axx.get_yticklabels())
labels = list()
step = min_max_range[dimension][2] / (ticks - 1)
mn = min_max_range[dimension][0]
for i in xrange(ticks):
v = mn + i*step
labels.append('%4.2f' % v)
axx.set_yticklabels(labels)
# Move the final axis' ticks to the right-hand side
axx = plt.twinx(axes[-1])
dimension += 1
axx.xaxis.set_major_locator(ticker.FixedLocator([x[-2], x[-1]]))
ticks = len(axx.get_yticklabels())
step = min_max_range[dimension][2] / (ticks - 1)
mn = min_max_range[dimension][0]
labels = ['%4.2f' % (mn + i*step) for i in xrange(ticks)]
axx.set_yticklabels(labels)
# Stack the subplots
plt.subplots_adjust(wspace=0)
return plt
if __name__ == '__main__':
import random
base = [0, 0, 5, 5, 0]
scale = [1.5, 2., 1.0, 2., 2.]
data = [[base[x] + random.uniform(0., 1.)*scale[x]
for x in xrange(5)] for y in xrange(30)]
colors = ['r'] * 30
base = [3, 6, 0, 1, 3]
scale = [1.5, 2., 2.5, 2., 2.]
data.extend([[base[x] + random.uniform(0., 1.)*scale[x]
for x in xrange(5)] for y in xrange(30)])
colors.extend(['b'] * 30)
parallel_coordinates(data, style=colors).show()
Edit 2:
Here is an example of what comes out of the above code when plotting Fisher's Iris data. It isn't quite as nice as the reference image from Wikipedia, but it is passable if all you have is Matplotlib and you need multi-dimensional plots.
pandas has a parallel coordinates wrapper:
import pandas
import matplotlib.pyplot as plt
from pandas.tools.plotting import parallel_coordinates
data = pandas.read_csv(r'C:\Python27\Lib\site-packages\pandas\tests\data\iris.csv', sep=',')
parallel_coordinates(data, 'Name')
plt.show()
Source code, how they made it: plotting.py#L494
When answering a related question, I worked out a version using only one subplot (so it can be easily fit together with other plots) and optionally using cubic bezier curves to connect the points. The plot adjusts itself to the desired number of axes.
import matplotlib.pyplot as plt
from matplotlib.path import Path
import matplotlib.patches as patches
import numpy as np
fig, host = plt.subplots()
# create some dummy data
ynames = ['P1', 'P2', 'P3', 'P4', 'P5']
N1, N2, N3 = 10, 5, 8
N = N1 + N2 + N3
category = np.concatenate([np.full(N1, 1), np.full(N2, 2), np.full(N3, 3)])
y1 = np.random.uniform(0, 10, N) + 7 * category
y2 = np.sin(np.random.uniform(0, np.pi, N)) ** category
y3 = np.random.binomial(300, 1 - category / 10, N)
y4 = np.random.binomial(200, (category / 6) ** 1/3, N)
y5 = np.random.uniform(0, 800, N)
# organize the data
ys = np.dstack([y1, y2, y3, y4, y5])[0]
ymins = ys.min(axis=0)
ymaxs = ys.max(axis=0)
dys = ymaxs - ymins
ymins -= dys * 0.05 # add 5% padding below and above
ymaxs += dys * 0.05
dys = ymaxs - ymins
# transform all data to be compatible with the main axis
zs = np.zeros_like(ys)
zs[:, 0] = ys[:, 0]
zs[:, 1:] = (ys[:, 1:] - ymins[1:]) / dys[1:] * dys[0] + ymins[0]
axes = [host] + [host.twinx() for i in range(ys.shape[1] - 1)]
for i, ax in enumerate(axes):
ax.set_ylim(ymins[i], ymaxs[i])
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
if ax != host:
ax.spines['left'].set_visible(False)
ax.yaxis.set_ticks_position('right')
ax.spines["right"].set_position(("axes", i / (ys.shape[1] - 1)))
host.set_xlim(0, ys.shape[1] - 1)
host.set_xticks(range(ys.shape[1]))
host.set_xticklabels(ynames, fontsize=14)
host.tick_params(axis='x', which='major', pad=7)
host.spines['right'].set_visible(False)
host.xaxis.tick_top()
host.set_title('Parallel Coordinates Plot', fontsize=18)
colors = plt.cm.tab10.colors
for j in range(N):
# to just draw straight lines between the axes:
# host.plot(range(ys.shape[1]), zs[j,:], c=colors[(category[j] - 1) % len(colors) ])
# create bezier curves
# for each axis, there will a control vertex at the point itself, one at 1/3rd towards the previous and one
# at one third towards the next axis; the first and last axis have one less control vertex
# x-coordinate of the control vertices: at each integer (for the axes) and two inbetween
# y-coordinate: repeat every point three times, except the first and last only twice
verts = list(zip([x for x in np.linspace(0, len(ys) - 1, len(ys) * 3 - 2, endpoint=True)],
np.repeat(zs[j, :], 3)[1:-1]))
# for x,y in verts: host.plot(x, y, 'go') # to show the control points of the beziers
codes = [Path.MOVETO] + [Path.CURVE4 for _ in range(len(verts) - 1)]
path = Path(verts, codes)
patch = patches.PathPatch(path, facecolor='none', lw=1, edgecolor=colors[category[j] - 1])
host.add_patch(patch)
plt.tight_layout()
plt.show()
Here's similar code for the iris data set. The second axis is reversed to avoid some crossing lines.
import matplotlib.pyplot as plt
from matplotlib.path import Path
import matplotlib.patches as patches
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
ynames = iris.feature_names
ys = iris.data
ymins = ys.min(axis=0)
ymaxs = ys.max(axis=0)
dys = ymaxs - ymins
ymins -= dys * 0.05 # add 5% padding below and above
ymaxs += dys * 0.05
ymaxs[1], ymins[1] = ymins[1], ymaxs[1] # reverse axis 1 to have less crossings
dys = ymaxs - ymins
# transform all data to be compatible with the main axis
zs = np.zeros_like(ys)
zs[:, 0] = ys[:, 0]
zs[:, 1:] = (ys[:, 1:] - ymins[1:]) / dys[1:] * dys[0] + ymins[0]
fig, host = plt.subplots(figsize=(10,4))
axes = [host] + [host.twinx() for i in range(ys.shape[1] - 1)]
for i, ax in enumerate(axes):
ax.set_ylim(ymins[i], ymaxs[i])
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
if ax != host:
ax.spines['left'].set_visible(False)
ax.yaxis.set_ticks_position('right')
ax.spines["right"].set_position(("axes", i / (ys.shape[1] - 1)))
host.set_xlim(0, ys.shape[1] - 1)
host.set_xticks(range(ys.shape[1]))
host.set_xticklabels(ynames, fontsize=14)
host.tick_params(axis='x', which='major', pad=7)
host.spines['right'].set_visible(False)
host.xaxis.tick_top()
host.set_title('Parallel Coordinates Plot — Iris', fontsize=18, pad=12)
colors = plt.cm.Set2.colors
legend_handles = [None for _ in iris.target_names]
for j in range(ys.shape[0]):
# create bezier curves
verts = list(zip([x for x in np.linspace(0, len(ys) - 1, len(ys) * 3 - 2, endpoint=True)],
np.repeat(zs[j, :], 3)[1:-1]))
codes = [Path.MOVETO] + [Path.CURVE4 for _ in range(len(verts) - 1)]
path = Path(verts, codes)
patch = patches.PathPatch(path, facecolor='none', lw=2, alpha=0.7, edgecolor=colors[iris.target[j]])
legend_handles[iris.target[j]] = patch
host.add_patch(patch)
host.legend(legend_handles, iris.target_names,
loc='lower center', bbox_to_anchor=(0.5, -0.18),
ncol=len(iris.target_names), fancybox=True, shadow=True)
plt.tight_layout()
plt.show()
I'm sure there is a better way of doing it, but here's a quick-and-dirty one (a really dirty one):
#!/usr/bin/python
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
#vectors to plot: 4D for this example
y1=[1,2.3,8.0,2.5]
y2=[1.5,1.7,2.2,2.9]
x=[1,2,3,8] # spines
fig,(ax,ax2,ax3) = plt.subplots(1, 3, sharey=False)
# plot the same on all the subplots
ax.plot(x,y1,'r-', x,y2,'b-')
ax2.plot(x,y1,'r-', x,y2,'b-')
ax3.plot(x,y1,'r-', x,y2,'b-')
# now zoom in each of the subplots
ax.set_xlim([ x[0],x[1]])
ax2.set_xlim([ x[1],x[2]])
ax3.set_xlim([ x[2],x[3]])
# set the x axis ticks
for axx,xx in zip([ax,ax2,ax3],x[:-1]):
axx.xaxis.set_major_locator(ticker.FixedLocator([xx]))
ax3.xaxis.set_major_locator(ticker.FixedLocator([x[-2],x[-1]])) # the last one
# EDIT: add the labels to the rightmost spine
for tick in ax3.yaxis.get_major_ticks():
tick.label2On=True
# stack the subplots together
plt.subplots_adjust(wspace=0)
plt.show()
This is essentially based on a (much nicer) one by Joe Kingon, Python/Matplotlib - Is there a way to make a discontinuous axis?. You might also want to have a look at the other answer to the same question.
In this example I don't even attempt at scaling the vertical scales, since it depends on what exactly you are trying to achieve.
EDIT: Here is the result
When using pandas (like suggested by theta), there is no way to scale the axes independently.
The reason you can't find the different vertical axes is because there aren't any. Our parallel coordinates is "faking" the other two axes by just drawing a vertical line and some labels.
https://github.com/pydata/pandas/issues/7083#issuecomment-74253671
I've adapted the #JohanC code to a pandas dataframe and expanded it to also work with categorical variables. The code needs more improving, like being able to put also a numerical variable as the first one in the dataframe, but I think it is nice for now.
# Paths:
path_data = "data/"
# Packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.path import Path
import matplotlib.patches as patches
from functools import reduce
# Display options:
pd.set_option("display.width", 1200)
pd.set_option("display.max_columns", 300)
pd.set_option("display.max_rows", 300)
# Dataset:
df = pd.read_csv(path_data + "nasa_exoplanets.csv")
df_varnames = pd.read_csv(path_data + "nasa_exoplanets_var_names.csv")
# Variables (the first variable must be categoric):
my_vars = ["discoverymethod", "pl_orbper", "st_teff", "disc_locale", "sy_gaiamag"]
my_vars_names = reduce(pd.DataFrame.append,
map(lambda i: df_varnames[df_varnames["var"] == i], my_vars))
my_vars_names = my_vars_names["var_name"].values.tolist()
# Adapt the data:
df = df.loc[df["pl_letter"] == "d"]
df_plot = df[my_vars]
df_plot = df_plot.dropna()
df_plot = df_plot.reset_index(drop = True)
# Convert to numeric matrix:
ym = []
dics_vars = []
for v, var in enumerate(my_vars):
if df_plot[var].dtype.kind not in ["i", "u", "f"]:
dic_var = dict([(val, c) for c, val in enumerate(df_plot[var].unique())])
dics_vars += [dic_var]
ym += [[dic_var[i] for i in df_plot[var].tolist()]]
else:
ym += [df_plot[var].tolist()]
ym = np.array(ym).T
# Padding:
ymins = ym.min(axis = 0)
ymaxs = ym.max(axis = 0)
dys = ymaxs - ymins
ymins -= dys*0.05
ymaxs += dys*0.05
# Reverse some axes for better visual:
axes_to_reverse = [0, 1]
for a in axes_to_reverse:
ymaxs[a], ymins[a] = ymins[a], ymaxs[a]
dys = ymaxs - ymins
# Adjust to the main axis:
zs = np.zeros_like(ym)
zs[:, 0] = ym[:, 0]
zs[:, 1:] = (ym[:, 1:] - ymins[1:])/dys[1:]*dys[0] + ymins[0]
# Colors:
n_levels = len(dics_vars[0])
my_colors = ["#F41E1E", "#F4951E", "#F4F01E", "#4EF41E", "#1EF4DC", "#1E3CF4", "#F41EF3"]
cmap = LinearSegmentedColormap.from_list("my_palette", my_colors)
my_palette = [cmap(i/n_levels) for i in np.array(range(n_levels))]
# Plot:
fig, host_ax = plt.subplots(
figsize = (20, 10),
tight_layout = True
)
# Make the axes:
axes = [host_ax] + [host_ax.twinx() for i in range(ym.shape[1] - 1)]
dic_count = 0
for i, ax in enumerate(axes):
ax.set_ylim(
bottom = ymins[i],
top = ymaxs[i]
)
ax.spines.top.set_visible(False)
ax.spines.bottom.set_visible(False)
ax.ticklabel_format(style = 'plain')
if ax != host_ax:
ax.spines.left.set_visible(False)
ax.yaxis.set_ticks_position("right")
ax.spines.right.set_position(
(
"axes",
i/(ym.shape[1] - 1)
)
)
if df_plot.iloc[:, i].dtype.kind not in ["i", "u", "f"]:
dic_var_i = dics_vars[dic_count]
ax.set_yticks(
range(len(dic_var_i))
)
ax.set_yticklabels(
[key_val for key_val in dics_vars[dic_count].keys()]
)
dic_count += 1
host_ax.set_xlim(
left = 0,
right = ym.shape[1] - 1
)
host_ax.set_xticks(
range(ym.shape[1])
)
host_ax.set_xticklabels(
my_vars_names,
fontsize = 14
)
host_ax.tick_params(
axis = "x",
which = "major",
pad = 7
)
# Make the curves:
host_ax.spines.right.set_visible(False)
host_ax.xaxis.tick_top()
for j in range(ym.shape[0]):
verts = list(zip([x for x in np.linspace(0, len(ym) - 1, len(ym)*3 - 2,
endpoint = True)],
np.repeat(zs[j, :], 3)[1: -1]))
codes = [Path.MOVETO] + [Path.CURVE4 for _ in range(len(verts) - 1)]
path = Path(verts, codes)
color_first_cat_var = my_palette[dics_vars[0][df_plot.iloc[j, 0]]]
patch = patches.PathPatch(
path,
facecolor = "none",
lw = 2,
alpha = 0.7,
edgecolor = color_first_cat_var
)
host_ax.add_patch(patch)
plotly has a nice interactive solution called parallel_coordinates which works just fine:
import plotly.express as px
df = px.data.iris()
fig = px.parallel_coordinates(df, color="species_id", labels={"species_id": "Species",
"sepal_width": "Sepal Width", "sepal_length": "Sepal Length",
"petal_width": "Petal Width", "petal_length": "Petal Length", },
color_continuous_scale=px.colors.diverging.Tealrose,
color_continuous_midpoint=2)
fig.show()
I want to plug a beta-released parallel coordinate plotting package called Paxplot which is based on Matplotlib. It uses similar underlying logic to the other answers and extends functionality while maintaining clean usage.
The documentation provides examples of basic usage, advanced usage, and usage with Pandas. As per the figure provided in the original question, I have provided a solution that plots the iris dataset:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
import paxplot
# Import data
iris = load_iris(as_frame=True)
df = pd.DataFrame(
data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target']
)
cols = df.columns
# Create figure
paxfig = paxplot.pax_parallel(n_axes=len(cols))
paxfig.plot(df.to_numpy())
# Add labels
paxfig.set_labels(cols)
# Set ticks
paxfig.set_ticks(
ax_idx=-1,
ticks=[0, 1, 2],
labels=iris.target_names
)
# Add colorbar
color_col = 0
paxfig.add_colorbar(
ax_idx=color_col,
cmap='viridis',
colorbar_kwargs={'label': cols[color_col]}
)
plt.show()
For full disclosure, I created Paxplot and have been developing and maintaining it with some friends. Definitely feel free to reach out if you are interested in contributing!
Best example I've seen thus far is this one
https://python.g-node.org/python-summerschool-2013/_media/wiki/datavis/olympics_vis.py
See the normalised_coordinates function. Not super fast, but works from what I've tried.
normalised_coordinates(['VAL_1', 'VAL_2', 'VAL_3'], np.array([[1230.23, 1500000, 12453.03], [930.23, 140000, 12453.03], [130.23, 120000, 1243.03]]), [1, 2, 1])
Still far from perfect but it works and is relatively short:
import numpy as np
import matplotlib.pyplot as plt
def plot_parallel(data,labels):
data=np.array(data)
x=list(range(len(data[0])))
fig, axis = plt.subplots(1, len(data[0])-1, sharey=False)
for d in data:
for i, a in enumerate(axis):
temp=d[i:i+2].copy()
temp[1]=(temp[1]-np.min(data[:,i+1]))*(np.max(data[:,i])-np.min(data[:,i]))/(np.max(data[:,i+1])-np.min(data[:,i+1]))+np.min(data[:,i])
a.plot(x[i:i+2], temp)
for i, a in enumerate(axis):
a.set_xlim([x[i], x[i+1]])
a.set_xticks([x[i], x[i+1]])
a.set_xticklabels([labels[i], labels[i+1]], minor=False, rotation=45)
a.set_ylim([np.min(data[:,i]),np.max(data[:,i])])
plt.subplots_adjust(wspace=0)
plt.show()
This is a version using TensorBoard, if not strictly need matplotlib figure.
I'm looking around for something works like Visualize the results in TensorBoard's HParams plugin result. Here is a wrapped function just plotting ignoring training in that tutorial, using TensorBoard. The logic is using metrics_name specified key as metrics, using other columns as HParams. For any other detail, refer original tutorial.
import os
import json
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp
def tensorboard_parallel_coordinates_plot(dataframe, metrics_name, metrics_display_name=None, skip_columns=[], log_dir='logs/hparam_tuning'):
skip_columns = skip_columns + [metrics_name]
to_hp_discrete = lambda column: hp.HParam(column, hp.Discrete(np.unique(dataframe[column].values).tolist()))
hp_params_dict = {column: to_hp_discrete(column) for column in dataframe.columns if column not in skip_columns}
if dataframe[metrics_name].values.dtype == 'object': # Not numeric
metrics_map = {ii: id for id, ii in enumerate(np.unique(dataframe[metrics_name]))}
description = json.dumps(metrics_map)
else:
metrics_map, description = None, None
METRICS = metrics_name if metrics_display_name is None else metrics_display_name
with tf.summary.create_file_writer(log_dir).as_default():
metrics = [hp.Metric(METRICS, display_name=METRICS, description=description)]
hp.hparams_config(hparams=list(hp_params_dict.values()), metrics=metrics)
for id in dataframe.index:
log = dataframe.iloc[id]
hparams = {hp_unit: log[column] for column, hp_unit in hp_params_dict.items()}
print({hp_unit.name: hparams[hp_unit] for hp_unit in hparams})
run_dir = os.path.join(log_dir, 'run-%d' % id)
with tf.summary.create_file_writer(run_dir).as_default():
hp.hparams(hparams) # record the values used in this trial
metric_item = log[metrics_name] if metrics_map is None else metrics_map[log[metrics_name]]
tf.summary.scalar(METRICS, metric_item, step=1)
print()
if metrics_map is not None:
print("metrics_map:", metrics_map)
print("Start tensorboard by: tensorboard --logdir {}".format(log_dir))
Plotting test:
aa = pd.read_csv('https://raw.github.com/pandas-dev/pandas/main/pandas/tests/io/data/csv/iris.csv')
tensorboard_parallel_coordinates_plot(aa, metrics_name="Name", log_dir="logs/iris")
# metrics_map: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
# Start tensorboard by: tensorboard --logdir logs/iris
!tensorboard --logdir logs/iris
# TensorBoard 2.8.0 at http://localhost:6006/ (Press CTRL+C to quit)
Open tesnorboard link, default http://localhost:6006/, go to HPARAMS -> PARALLEL COORDINATES VIEW will show the result:
TensorBoard result is interactive. But this is designed for plotting model hyper parameters tuning results, so I think it's not friendly for plotting large dataset.
You have to clean saved data manually if plotting new data in same log_dir directory.
It seems the final metrics item has to be numeric, while other axes don't have to.
fake_data = {
"optimizer": ["sgd", "adam", "adam", "lamb", "lamb", "lamb", "lamb"],
"weight_decay": [0.1, 0.1, 0.2, 0.1, 0.2, 0.2, 0.3],
"rescale_mode": ["tf", "tf", "tf", "tf", "tf", "torch", "torch"],
"accuracy": [78.5, 78.2, 78.8, 79.2, 79.3, 79.5, 79.6],
}
aa = pd.DataFrame(fake_data)
tensorboard_parallel_coordinates_plot(aa, "accuracy", log_dir="logs/fake")
# Start tensorboard by: tensorboard --logdir logs/fake
!tensorboard --logdir logs/fake
# TensorBoard 2.8.0 at http://localhost:6006/ (Press CTRL+C to quit)
Related
The person who made this had used dates in the second graph. I was wondering how would dates be used with the scipy.signal.argrelextrema function.
With this code it doesn't do anything it prints out an empty array for peak_x and peak_y:
data_y = np.array('2015-07-04', dtype=np.datetime64) + np.arange(25)
Here's the link for the original code:
https://openwritings.net/pg/python/python-find-peaks-and-valleys-chart-using-scipysignalargrelextrema
import matplotlib
matplotlib.use('Agg') # Bypass the need to install Tkinter GUI framework
from scipy import signal
import numpy as np
import matplotlib.pyplot as plt
# Generate random data.
data_x = np.arange(start = 0, stop = 25, step = 1, dtype='int')
data_y = np.array('2015-07-04', dtype=np.datetime64) + np.arange(25) #edited part
# Find peaks(max).
peak_indexes = signal.argrelextrema(data_y, np.greater)
peak_indexes = peak_indexes[0]
# Find valleys(min).
valley_indexes = signal.argrelextrema(data_y, np.less)
valley_indexes = valley_indexes[0]
# Plot main graph.
(fig, ax) = plt.subplots()
ax.plot(data_x, data_y)
# Plot peaks.
peak_x = peak_indexes
peak_y = data_y[peak_indexes]
ax.plot(peak_x, peak_y, marker='o', linestyle='dashed', color='green', label="Peaks")
print(peak_x,peak_y)
# Plot valleys.
valley_x = valley_indexes
valley_y = data_y[valley_indexes]
ax.plot(valley_x, valley_y, marker='o', linestyle='dashed', color='red', label="Valleys")
# Save graph to file.
plt.title('Find peaks and valleys using argrelextrema()')
plt.legend(loc='best')
plt.savefig('argrelextrema.png')
Here's the example how it would work:
You're going to want to use the xticks method. See below:
import matplotlib.pyplot as plt
names = [str(i) for i in range(20)]
x_data = [x for x in range(20)]
y_data = [x for x in range(20)]
plt.plot(x_data, y_data)
plt.xticks(x_data, label=names)
plt.show()
What this does is use an integer between 1-19 cast as a string as the label for the axis X.
Except in your case you want to swap out the names for datatime objects cast to strings. For the xticks, the x_data element prescribes where the ticks will be. You may use any interval of points so long as they are within the bounds of the xdata.
In your case, replace:
data_y = np.array('2015-07-04', dtype=np.datetime64) + np.arange(25)
with
data_y_ticks = np.array('2015-07-04', dtype=np.datetime64) + np.arange(25)
data_y = [i for i, _ in enumerate(data_y_ticks.tolist())]
then plot as follows:
plt.plot(data_y, x_data)
plt.xticks(data_y, label=data_y_ticks)
plt.show()
Just a heads-up, your X and Y axis names are flipped in your code. I did not correct this in my example, however did interchange their locations in the plot to make the plot make sense.
I want to rescale my (qualitative) x-axis so the two peaks (visible in the graph) correlate to their actual values (i. e. 500 keV and 1274 MeV).
How can I do this?
import numpy as np
import matplotlib.pyplot as plt
def read_from_file(filename):
return np.loadtxt(filename)
data = list(read_from_file("calibration.txt"))
print(data.index(max(data[:2000])))#x value 500kev
print(data.index(max(data[2000:])))#x value 1274
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(len(data))
plt.plot(x, data)
plt.xlim(0, 5000)
plt.ylim(0, 7000)
plt.title("$^{22}$Na Spectrum")
plt.xlabel("Energy")
plt.ylabel("Amount of Photons")
plt.grid()
ax.annotate("500 keV", xy = (1450, 6541), xytext = (1600, 6500))
ax.annotate("1274 MeV", xy = (3500, 950), xytext = (3700, 1100))
plt.show()
Using numpy, you can find the index of the two spikes (i.e. no need to convert the data to a list) using argmax.
Then, you can scale the x values using:
xnew = val1 + (x - max1) / (max2 - max1) * (val2 - val1)
where val1 and val2 are the values of your peaks, and max1 and max2 are the indices of those peaks.
Here's a bit of code that should work:
import numpy as np
import matplotlib.pyplot as plt
# Fake some data approximately in your range. You can ignore this bit!
# Random numbers for noise
data = 1000. + np.random.rand(5000) * 100.
x = np.arange(len(data))
# Add the first spike
mu1, sd1 = 1450., 300.
pdf1 = (1./(sd1*2.*np.pi) * np.exp(-(x - mu1)**2 / sd1**2)) * 1e7
data += pdf1
# Add the second spike
mu2, sd2 = 3500., 200.
pdf2 = (1./(sd2*2.*np.pi) * np.exp(-(x - mu2)**2 / sd2**2)) * 1e6
data += pdf2
# End of fake data generation
# Find the index of the first maximum (using your '2000' cutoff)
cutoff = 2000
max1 = float(np.argmax(data[:cutoff]))
# Find the index of the second cutoff
max2 = float(np.argmax(data[cutoff:]) + cutoff)
# The actual values of the two spikes
val1, val2 = 500., 1274
# Scale the xvalues
xnew = val1 + (x - max1) / (max2 - max1) * (val2 - val1)
# Plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(xnew, data)
ax.set_ylim(0, 7000)
ax.set_title("$^{22}$Na Spectrum")
ax.set_xlabel("Energy")
ax.set_ylabel("Number of Photons")
ax.grid()
# Add some lines at the actual spikes to check scaling worked
ax.axvline(val1)
ax.axvline(val2)
plt.show()
Funny you should ask this question. I am currently trying to push an example into MatPlotLib that shows exactly how to do this. You can view the recipe here: https://github.com/madphysicist/matplotlib/blob/7b05223c85741120019b81e1248c20f9bc090c61/examples/ticks_and_spines/tick_transform_formatter.py
You do not need the entire code in the example (or the tick formatter that uses it) but the mapping function will help you create the scaled x-array (also, use np.argmax instead of index(max(...)):
ind500 = np.argmaxmax(data[:2000]))
ind1274 = np.argmax(data[2000:])) + 2000
x_scaled = (x - ind500) * (1274 - 500) / (ind1274 - ind500) + 500
You can use x_scaled to plot as usual:
plt.plot(x_scaled, data)
...
Combining it all together (and making a couple of tweaks to use OO API instead of pyplot):
import numpy as np
from matplotlib import pyplot as plt
data = np.loadtxt("calibration.txt") # Don't convert this back to a list
ind500 = np.argmaxmax(data[:2000]))
ind1274 = np.argmax(data[2000:])) + 2000
x = (np.arange(len(data)) - ind500) * (1274 - 500) / (ind1274 - ind500) + 500
fig, ax = plt.subplots()
ax.plot(x, data)
plt.title("$^{22}$Na Spectrum")
plt.xlabel("Energy")
plt.ylabel("Photons Counts")
plt.grid()
ax.annotate("500 keV", xy = (500, data[ind500]), xytext = (550, data[ind500] + 100))
ax.annotate("1274 keV", xy = (1274, data[ind1274]), xytext = (1324, data[ind1274] + 100))
plt.show()
The example I linked to would allow you to display the x-axis in entirely different units without actually modifying your x-array.
I've made a histogram of some data. It represents a bunch of stars, some of which have been observed, others of which have not. The white space represents the ones not observed. I would like to make it so that above the bins where there is white space, it would label the amount of white counts in those bins, but ignore the green and orange. If above the bins is complex, then below them is fine as well. Wherever is simplest. Here's the plot and the code:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib import font_manager
# hfont = {'fontname':'Computer Modern'}
filename = ("master.kkids.obslist")
datadata = np.loadtxt(filename,skiprows=14,dtype=str)
x = datadata[:,7]
y = datadata[:,8]
u = datadata[:,18]
v = datadata[:,19]
z = datadata[:,20]
yesGEM = (v == "Y") & (u == "GemN")
yesDCT = (v == "Y") & (u == "DCT")
no = (v == "N")
x1 = x.astype(float)
y1 = y.astype(float)
x2 = x1*24/360
colors = ['green', 'orange', 'white']
labels = ['GemN','DCT','Not Observed']
plt.xlim(0,24)
plt.hist((x2[yesGEM],x2[yesDCT],x2[no]), 24, label=labels, color=colors, histtype='bar', stacked=True)
plt.legend(fancybox=True,shadow=True)
plt.ylabel('Frequency')
plt.xlabel('Right Ascension (h)')
plt.savefig('kkids.hist.png')
The general pattern can be to loop on your x axis, for each stack check if there is a 'Not Observed', if so, fetch the values of GemN + DCT + (Not Observed / 2) for this observation, then use these two values as x and y to plot the desired text (i.e the value of not observed) using plt.text(x, y, text).
So if y understand correctly your dataset, i guess something like this should do the job :
# Get the returned arrays from plt.hist, they contain stacked frequencies
phist = plt.hist((x2[yesGEM],x2[yesDCT],x2[no]), 24, label=labels,
color=colors, histtype='bar', stacked=True)
plt.legend(fancybox=True,shadow=True)
plt.ylabel('Frequency')
plt.xlabel('Right Ascension (h)')
# Reshape the frequencies values :
stack_plots = \
np.concatenate(phist[0]).reshape(len(labels), len(phist[0][0])).T
for nb_x, stack in enumerate(stack_plots):
# 'stack' is an array with the frequency of [GEM, DCT + GEM, CDT + GEM + NO]
if stack[2] - stack[1] != 0: # If there is Not Observed..
# Compute the coords for the text :
x_text = nb_x
y_text = stack[1] + (stack[2] - stack[1]) / 2
# And plot it:
plt.text(x_text, y_text, round(stack[2] - stack[1]), verticalalignment='center')
scikit-learn has a very nice demo that creates an outlier analysis tool. Here is the
import numpy as np
import pylab as pl
import matplotlib.font_manager
from scipy import stats
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]
# define two outlier detection tools to be compared
classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"robust covariance estimator": EllipticEnvelope(contamination=.1)}
# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)
ground_truth = np.ones(n_samples, dtype=int)
ground_truth[-n_outliers:] = 0
# Fit the problem with varying cluster separation
for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(0.5 * n_inliers, 2) - offset
X2 = 0.3 * np.random.randn(0.5 * n_inliers, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
# Fit the model with the One-Class SVM
pl.figure(figsize=(10, 5))
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
clf.fit(X)
y_pred = clf.decision_function(X).ravel()
threshold = stats.scoreatpercentile(y_pred,
100 * outliers_fraction)
y_pred = y_pred > threshold
n_errors = (y_pred != ground_truth).sum()
# plot the levels lines and the points
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
subplot = pl.subplot(1, 2, i + 1)
subplot.set_title("Outlier detection")
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
cmap=pl.cm.Blues_r)
a = subplot.contour(xx, yy, Z, levels=[threshold],
linewidths=2, colors='red')
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
colors='orange')
b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white')
c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black')
subplot.axis('tight')
subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'],
prop=matplotlib.font_manager.FontProperties(size=11))
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
pl.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
pl.show()
And here is what it looks like:
Is that cool or what?
However, I want the plot to be mouse-sensitive. That is, I want to be able to click on dots and find out what they are, with either a tool-tip or with a pop-up window, or something in a scroller. And I'd also like to be able to click-to-zoom, rather than zoom with a bounding box.
Is there any way to do this?
Not to plug my own project to much, but have a look at mpldatacursor. If you'd prefer, it's also quite easy to implement from scratch.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from mpldatacursor import datacursor
x1, y1 = np.random.random((2, 5))
x2, y2 = np.random.random((2, 5))
fig, ax = plt.subplots()
ax.plot(x1, y1, 'ro', markersize=12, label='Series A')
ax.plot(x2, y2, 'bo', markersize=12, label='Series B')
ax.legend()
datacursor()
plt.show()
For this to work with the example code you posted, you'd need to change things slightly. As it is, the artist labels are set in the call to legend, instead of when the artist is created. This means that there's no way to retrieve what's displayed in the legend for a particular artist. All you'd need to do is just pass in the labels as a kwarg to scatter instead of as the second argument to legend, and things should work as you were wanting.
In Matplotlib, it's not too tough to make a legend (example_legend(), below), but I think it's better style to put labels right on the curves being plotted (as in example_inline(), below). This can be very fiddly, because I have to specify coordinates by hand, and, if I re-format the plot, I probably have to reposition the labels. Is there a way to automatically generate labels on curves in Matplotlib? Bonus points for being able to orient the text at an angle corresponding to the angle of the curve.
import numpy as np
import matplotlib.pyplot as plt
def example_legend():
plt.clf()
x = np.linspace(0, 1, 101)
y1 = np.sin(x * np.pi / 2)
y2 = np.cos(x * np.pi / 2)
plt.plot(x, y1, label='sin')
plt.plot(x, y2, label='cos')
plt.legend()
def example_inline():
plt.clf()
x = np.linspace(0, 1, 101)
y1 = np.sin(x * np.pi / 2)
y2 = np.cos(x * np.pi / 2)
plt.plot(x, y1, label='sin')
plt.plot(x, y2, label='cos')
plt.text(0.08, 0.2, 'sin')
plt.text(0.9, 0.2, 'cos')
Update: User cphyc has kindly created a Github repository for the code in this answer (see here), and bundled the code into a package which may be installed using pip install matplotlib-label-lines.
Pretty Picture:
In matplotlib it's pretty easy to label contour plots (either automatically or by manually placing labels with mouse clicks). There does not (yet) appear to be any equivalent capability to label data series in this fashion! There may be some semantic reason for not including this feature which I am missing.
Regardless, I have written the following module which takes any allows for semi-automatic plot labelling. It requires only numpy and a couple of functions from the standard math library.
Description
The default behaviour of the labelLines function is to space the labels evenly along the x axis (automatically placing at the correct y-value of course). If you want you can just pass an array of the x co-ordinates of each of the labels. You can even tweak the location of one label (as shown in the bottom right plot) and space the rest evenly if you like.
In addition, the label_lines function does not account for the lines which have not had a label assigned in the plot command (or more accurately if the label contains '_line').
Keyword arguments passed to labelLines or labelLine are passed on to the text function call (some keyword arguments are set if the calling code chooses not to specify).
Issues
Annotation bounding boxes sometimes interfere undesirably with other curves. As shown by the 1 and 10 annotations in the top left plot. I'm not even sure this can be avoided.
It would be nice to specify a y position instead sometimes.
It's still an iterative process to get annotations in the right location
It only works when the x-axis values are floats
Gotchas
By default, the labelLines function assumes that all data series span the range specified by the axis limits. Take a look at the blue curve in the top left plot of the pretty picture. If there were only data available for the x range 0.5-1 then then we couldn't possibly place a label at the desired location (which is a little less than 0.2). See this question for a particularly nasty example. Right now, the code does not intelligently identify this scenario and re-arrange the labels, however there is a reasonable workaround. The labelLines function takes the xvals argument; a list of x-values specified by the user instead of the default linear distribution across the width. So the user can decide which x-values to use for the label placement of each data series.
Also, I believe this is the first answer to complete the bonus objective of aligning the labels with the curve they're on. :)
label_lines.py:
from math import atan2,degrees
import numpy as np
#Label line with line2D label data
def labelLine(line,x,label=None,align=True,**kwargs):
ax = line.axes
xdata = line.get_xdata()
ydata = line.get_ydata()
if (x < xdata[0]) or (x > xdata[-1]):
print('x label location is outside data range!')
return
#Find corresponding y co-ordinate and angle of the line
ip = 1
for i in range(len(xdata)):
if x < xdata[i]:
ip = i
break
y = ydata[ip-1] + (ydata[ip]-ydata[ip-1])*(x-xdata[ip-1])/(xdata[ip]-xdata[ip-1])
if not label:
label = line.get_label()
if align:
#Compute the slope
dx = xdata[ip] - xdata[ip-1]
dy = ydata[ip] - ydata[ip-1]
ang = degrees(atan2(dy,dx))
#Transform to screen co-ordinates
pt = np.array([x,y]).reshape((1,2))
trans_angle = ax.transData.transform_angles(np.array((ang,)),pt)[0]
else:
trans_angle = 0
#Set a bunch of keyword arguments
if 'color' not in kwargs:
kwargs['color'] = line.get_color()
if ('horizontalalignment' not in kwargs) and ('ha' not in kwargs):
kwargs['ha'] = 'center'
if ('verticalalignment' not in kwargs) and ('va' not in kwargs):
kwargs['va'] = 'center'
if 'backgroundcolor' not in kwargs:
kwargs['backgroundcolor'] = ax.get_facecolor()
if 'clip_on' not in kwargs:
kwargs['clip_on'] = True
if 'zorder' not in kwargs:
kwargs['zorder'] = 2.5
ax.text(x,y,label,rotation=trans_angle,**kwargs)
def labelLines(lines,align=True,xvals=None,**kwargs):
ax = lines[0].axes
labLines = []
labels = []
#Take only the lines which have labels other than the default ones
for line in lines:
label = line.get_label()
if "_line" not in label:
labLines.append(line)
labels.append(label)
if xvals is None:
xmin,xmax = ax.get_xlim()
xvals = np.linspace(xmin,xmax,len(labLines)+2)[1:-1]
for line,x,label in zip(labLines,xvals,labels):
labelLine(line,x,label,align,**kwargs)
Test code to generate the pretty picture above:
from matplotlib import pyplot as plt
from scipy.stats import loglaplace,chi2
from labellines import *
X = np.linspace(0,1,500)
A = [1,2,5,10,20]
funcs = [np.arctan,np.sin,loglaplace(4).pdf,chi2(5).pdf]
plt.subplot(221)
for a in A:
plt.plot(X,np.arctan(a*X),label=str(a))
labelLines(plt.gca().get_lines(),zorder=2.5)
plt.subplot(222)
for a in A:
plt.plot(X,np.sin(a*X),label=str(a))
labelLines(plt.gca().get_lines(),align=False,fontsize=14)
plt.subplot(223)
for a in A:
plt.plot(X,loglaplace(4).pdf(a*X),label=str(a))
xvals = [0.8,0.55,0.22,0.104,0.045]
labelLines(plt.gca().get_lines(),align=False,xvals=xvals,color='k')
plt.subplot(224)
for a in A:
plt.plot(X,chi2(5).pdf(a*X),label=str(a))
lines = plt.gca().get_lines()
l1=lines[-1]
labelLine(l1,0.6,label=r'$Re=${}'.format(l1.get_label()),ha='left',va='bottom',align = False)
labelLines(lines[:-1],align=False)
plt.show()
#Jan Kuiken's answer is certainly well-thought and thorough, but there are some caveats:
it does not work in all cases
it requires a fair amount of extra code
it may vary considerably from one plot to the next
A much simpler approach is to annotate the last point of each plot. The point can also be circled, for emphasis. This can be accomplished with one extra line:
import matplotlib.pyplot as plt
for i, (x, y) in enumerate(samples):
plt.plot(x, y)
plt.text(x[-1], y[-1], f'sample {i}')
A variant would be to use the method matplotlib.axes.Axes.annotate.
Nice question, a while ago I've experimented a bit with this, but haven't used it a lot because it's still not bulletproof. I divided the plot area into a 32x32 grid and calculated a 'potential field' for the best position of a label for each line according the following rules:
white space is a good place for a label
Label should be near corresponding line
Label should be away from the other lines
The code was something like this:
import matplotlib.pyplot as plt
import numpy as np
from scipy import ndimage
def my_legend(axis = None):
if axis == None:
axis = plt.gca()
N = 32
Nlines = len(axis.lines)
print Nlines
xmin, xmax = axis.get_xlim()
ymin, ymax = axis.get_ylim()
# the 'point of presence' matrix
pop = np.zeros((Nlines, N, N), dtype=np.float)
for l in range(Nlines):
# get xy data and scale it to the NxN squares
xy = axis.lines[l].get_xydata()
xy = (xy - [xmin,ymin]) / ([xmax-xmin, ymax-ymin]) * N
xy = xy.astype(np.int32)
# mask stuff outside plot
mask = (xy[:,0] >= 0) & (xy[:,0] < N) & (xy[:,1] >= 0) & (xy[:,1] < N)
xy = xy[mask]
# add to pop
for p in xy:
pop[l][tuple(p)] = 1.0
# find whitespace, nice place for labels
ws = 1.0 - (np.sum(pop, axis=0) > 0) * 1.0
# don't use the borders
ws[:,0] = 0
ws[:,N-1] = 0
ws[0,:] = 0
ws[N-1,:] = 0
# blur the pop's
for l in range(Nlines):
pop[l] = ndimage.gaussian_filter(pop[l], sigma=N/5)
for l in range(Nlines):
# positive weights for current line, negative weight for others....
w = -0.3 * np.ones(Nlines, dtype=np.float)
w[l] = 0.5
# calculate a field
p = ws + np.sum(w[:, np.newaxis, np.newaxis] * pop, axis=0)
plt.figure()
plt.imshow(p, interpolation='nearest')
plt.title(axis.lines[l].get_label())
pos = np.argmax(p) # note, argmax flattens the array first
best_x, best_y = (pos / N, pos % N)
x = xmin + (xmax-xmin) * best_x / N
y = ymin + (ymax-ymin) * best_y / N
axis.text(x, y, axis.lines[l].get_label(),
horizontalalignment='center',
verticalalignment='center')
plt.close('all')
x = np.linspace(0, 1, 101)
y1 = np.sin(x * np.pi / 2)
y2 = np.cos(x * np.pi / 2)
y3 = x * x
plt.plot(x, y1, 'b', label='blue')
plt.plot(x, y2, 'r', label='red')
plt.plot(x, y3, 'g', label='green')
my_legend()
plt.show()
And the resulting plot:
matplotx (which I wrote) has line_labels() which plots the labels to the right of the lines. It's also smart enough to avoid overlaps when too many lines are concentrated in one spot. (See stargraph for examples.) It does that by solving a particular non-negative-least-squares problem on the target positions of the labels. Anyway, in many cases where there's no overlap to begin with, such as the example below, that's not even necessary.
import matplotlib.pyplot as plt
import matplotx
import numpy as np
# create data
rng = np.random.default_rng(0)
offsets = [1.0, 1.50, 1.60]
labels = ["no balancing", "CRV-27", "CRV-27*"]
x0 = np.linspace(0.0, 3.0, 100)
y = [offset * x0 / (x0 + 1) + 0.1 * rng.random(len(x0)) for offset in offsets]
# plot
with plt.style.context(matplotx.styles.dufte):
for yy, label in zip(y, labels):
plt.plot(x0, yy, label=label)
plt.xlabel("distance [m]")
matplotx.ylabel_top("voltage [V]") # move ylabel to the top, rotate
matplotx.line_labels() # line labels to the right
plt.show()
# plt.savefig("out.png", bbox_inches="tight")
A simpler approach like the one Ioannis Filippidis do :
import matplotlib.pyplot as plt
import numpy as np
# evenly sampled time at 200ms intervals
tMin=-1 ;tMax=10
t = np.arange(tMin, tMax, 0.1)
# red dashes, blue points default
plt.plot(t, 22*t, 'r--', t, t**2, 'b')
factor=3/4 ;offset=20 # text position in view
textPosition=[(tMax+tMin)*factor,22*(tMax+tMin)*factor]
plt.text(textPosition[0],textPosition[1]+offset,'22 t',color='red',fontsize=20)
textPosition=[(tMax+tMin)*factor,((tMax+tMin)*factor)**2+20]
plt.text(textPosition[0],textPosition[1]+offset, 't^2', bbox=dict(facecolor='blue', alpha=0.5),fontsize=20)
plt.show()
code python 3 on sageCell