Plot outliers using matplotlib and seaborn

Plot outliers using matplotlib and seaborn - python

I have performed outlier detection on some entrance sensor data for a shopping mall. I want create one plot for each entrance and highlight the observations that are outliers (which are marked by True in the outlier column in the dataframe).
Here is a small snippet of the data for two entrances and a time span of six days:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({"date": [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
"mall": ["Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1"],
"entrance": ["West", "West","West","West","West", "West", "East", "East", "East", "East", "East", "East"],
"in": [132, 140, 163, 142, 133, 150, 240, 250, 233, 234, 2000, 222],
"outlier": [False, False, False, False, False, False, False, False, False, False, True, False]})
In order to create several plots (there are twenty entrances in the full data), I have come across lmplot in seaborn.
sns.set_theme(style="darkgrid")
for i, group in df.groupby('entrance'):
sns.lmplot(x="date", y="in", data=group, fit_reg=False, hue = "entrance")
#pseudo code
#for the rows that have an outlier (outlier = True) create a red dot for that observation
plt.show()
There are two things I would like to accomplish here:
Lineplot instead of scatterplot. I have not been successful in using sns.lineplot for creating separate plots for each entrance, as it seems lmplot is more fit for this.
For each entrance plot, I would like show which of the observations that are outliers, preferably as a red dot. I have tried writing some pseudo code in my plotting attempts.

seaborn.lmplot is a Facetgrid, which I think is more difficult to use, in this case.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
for i, group in df.groupby(['entrance']):
# plot all the values as a lineplot
sns.lineplot(x="date", y="in", data=group)
# select the data when outlier is True and plot it
data_t = group[group.outlier == True]
sns.scatterplot(x="date", y="in", data=data_t, c=['r'])
# add a title using the value from the groupby
plt.title(f'Entrance: {i}')
# show the plot here, not outside the loop
plt.show()
Alternate option
This option will allow for setting the number of columns and rows of a figure
import math
# specify the number of columns to plot
ncols = 2
# determine the number of rows, even if there's an odd number of unique entrances
nrows = math.ceil(len(df.entrance.unique()) / ncols)
fig, axes = plt.subplots(ncols=ncols, nrows=nrows, figsize=(16, 16))
# extract the axes into an nx1 array, which is easier to index with idx.
axes = axes.ravel()
for idx, (i, group) in enumerate(df.groupby(['entrance'])):
# plot all the values as a lineplot
sns.lineplot(x="date", y="in", data=group, ax=axes[idx])
# select the data when outlier is True and plot it
data_t = group[group.outlier == True]
sns.scatterplot(x="date", y="in", data=data_t, c=['r'], ax=axes[idx])
axes[idx].set_title(f'Entrance: {i}')

Related

Scatter plot to show the majorities and include extreme numbers

Simple data as below and I want to put them in a scatter plot.
It goes well if there's not outliers (i.e. extremely big numbers).
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
dates = ["2021-01-01",
"2021-01-01", "2021-01-06",
"2021-01-08", "2021-01-12",
"2021-02-01", "2021-02-11",
"2021-02-12", "2021-02-15",
"2021-02-16", "2021-03-11",
"2021-03-21", "2021-03-22",
"2021-03-23", "2021-03-24",
"2021-04-02", "2021-04-12",
"2021-04-22", "2021-04-26",
"2021-04-30"]
numbers= [6400,
5100,5000,
4000,3686,
9000,8050,
8000,6050,
6000,9000,
8500,7800,
7000,6000,
10000,9600,
8000,7883,
6686]
dates = [pd.to_datetime(d) for d in dates]
plt.scatter(dates, numbers, s =100, c = 'red')
plt.show()
But when there are one or more extreme numbers, for example the last number 6686 became 66860. The new plot shows most the scatters insignificant (because of the the new y-axis).
What's the good solution to have a scatter plot as before (keeping the y-axis as it was), and still visualizing the extreme numbers?
The purpose of the chart is show and focus the distribution of the scatters under 10000, and also note there are extreme numbers.

If you don't want to use a log scale, you can break the plot in two (or more) and plot the values below/above a threshold:
df = pd.DataFrame({'num': numbers}, index=dates)
thresh = 12000
f, (ax1, ax2) = plt.subplots(nrows=2, sharex=True,
gridspec_kw={'height_ratios': (1,3)},
figsize=(10,4)
)
lows = df.mask(df['num'].ge(thresh))
highs = df.mask(df['num'].lt(thresh))
ax2.scatter(df.index, lows)
ax1.scatter(df.index, highs)
output:

Create multiple boxplots from statistics in one graph

I am having trouble finding a solution to plot multiple boxplots created from statistics into one graph.
From another application, I get a Dataframe that contains the different metrics needed to draw boxplots (median, quantile 1, ...). While I am able to plot a single boxplot from these statistics with the following code:
data = pd.read_excel("data.xlsx")
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)
row = data.iloc[:, 0]
stats = [{
"label": i, # not required
"mean": row["sharpeRatio"], # not required
"med": row["sharpeRatio_med"],
"q1": row["sharpeRatio_q1"],
"q3": row["sharpeRatio_q3"],
# "cilo": 5.3 # not required
# "cihi": 5.7 # not required
"whislo": row["sharpeRatio_min"], # required
"whishi": row["sharpeRatio_max"], # required
"fliers": [] # required if showfliers=True
}]
axes.bxp(stats)
plt.show()
I am struggling to create a graph containing boxplots from all the rows in the dataframe. Do you have an idea how to achieve this?

You can pass a list of dictionaries to the bxp method. The easiest way to get such a list from your existing code is to put the dictionary construction inside a function and call it for each row of the dataframe.
Note that data.iloc[:, 0] would be the first column, not the first row.
import matplotlib.pyplot as plt
import pandas as pd
def stats(row):
return {"med": row["sharpeRatio_med"],
"q1": row["sharpeRatio_q1"],
"q3": row["sharpeRatio_q3"],
"whislo": row["sharpeRatio_min"],
"whishi": row["sharpeRatio_max"]}
data = pd.DataFrame({"sharpeRatio_med": [3, 4, 2],
"sharpeRatio_q1": [2, 3, 1],
"sharpeRatio_q3": [4, 5, 3],
"sharpeRatio_min": [1, 1, 0],
"sharpeRatio_max": [5, 6, 4]})
fig, axes = plt.subplots()
axes.bxp([stats(data.iloc[i, :]) for i in range(len(data))],
showfliers=False)
plt.show()

Plotly clustered heatmap (with dendrogram)/Python

I am trying to create a clustered heatmap (with a dendrogram) using plotly in Python. The one they have made in their website does not scale well, I have come to various solutions, but most of them are in R or JavaScript. I am trying to create a heatmap with a dendrogram from the left side of the heatmap only, showing clusters across the y axis (from the hierarchical clustering). A really good looking example is this one: https://chart-studio.plotly.com/~jackp/6748. My purpose is to create something like this, but only with the left-side dendrogram. If someone can implement something like this in Python, I will be really grateful!
Let the data be X = np.random.randint(0, 10, size=(120, 10))

The following suggestion draws on elements from both Dendrograms in Python and chart-studio.plotly.com/~jackp. This particular plot uses your data X = np.random.randint(0, 10, size=(120, 10)). One thing that the linked approaches had in common, was, in my opinion, that the datasets and data munging procedures were a bit messy. So I decided to build the following figure on a pandas dataframe with df = pd.DataFrame(X) to hopefully make everything a bit clearer
Plot
Complete code
import plotly.graph_objects as go
import plotly.figure_factory as ff
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
import random
import string
X = np.random.randint(0, 10, size=(120, 10))
df = pd.DataFrame(X)
# Initialize figure by creating upper dendrogram
fig = ff.create_dendrogram(df.values, orientation='bottom')
fig.for_each_trace(lambda trace: trace.update(visible=False))
for i in range(len(fig['data'])):
fig['data'][i]['yaxis'] = 'y2'
# Create Side Dendrogram
# dendro_side = ff.create_dendrogram(X, orientation='right', labels = labels)
dendro_side = ff.create_dendrogram(X, orientation='right')
for i in range(len(dendro_side['data'])):
dendro_side['data'][i]['xaxis'] = 'x2'
# Add Side Dendrogram Data to Figure
for data in dendro_side['data']:
fig.add_trace(data)
# Create Heatmap
dendro_leaves = dendro_side['layout']['yaxis']['ticktext']
dendro_leaves = list(map(int, dendro_leaves))
data_dist = pdist(df.values)
heat_data = squareform(data_dist)
heat_data = heat_data[dendro_leaves,:]
heat_data = heat_data[:,dendro_leaves]
heatmap = [
go.Heatmap(
x = dendro_leaves,
y = dendro_leaves,
z = heat_data,
colorscale = 'Blues'
)
]
heatmap[0]['x'] = fig['layout']['xaxis']['tickvals']
heatmap[0]['y'] = dendro_side['layout']['yaxis']['tickvals']
# Add Heatmap Data to Figure
for data in heatmap:
fig.add_trace(data)
# Edit Layout
fig.update_layout({'width':800, 'height':800,
'showlegend':False, 'hovermode': 'closest',
})
# Edit xaxis
fig.update_layout(xaxis={'domain': [.15, 1],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'ticks':""})
# Edit xaxis2
fig.update_layout(xaxis2={'domain': [0, .15],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks':""})
# Edit yaxis
fig.update_layout(yaxis={'domain': [0, 1],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks': ""
})
# # Edit yaxis2
fig.update_layout(yaxis2={'domain':[.825, .975],
'mirror': False,
'showgrid': False,
'showline': False,
'zeroline': False,
'showticklabels': False,
'ticks':""})
fig.update_layout(paper_bgcolor="rgba(0,0,0,0)",
plot_bgcolor="rgba(0,0,0,0)",
xaxis_tickfont = dict(color = 'rgba(0,0,0,0)'))
fig.show()

The simplest solution to this problem is to use dash_bio.Clustergram function in dash_bio package.
import pandas as pd
import dash_bio as dashbio
X = np.random.randint(0, 10, size=(120, 10))
dashbio.Clustergram(
data=X,
# row_labels=rows,
# column_labels=columns,
cluster='row',
color_threshold={
'row': 250,
'col': 700
},
height=800,
width=700,
color_map= [
[0.0, '#636EFA'],
[0.25, '#AB63FA'],
[0.5, '#FFFFFF'],
[0.75, '#E763FA'],
[1.0, '#EF553B']
]
)
An more laborious solution is to use the plot function plotly.figure_factory.create_dendrogram combined with plotly.graph_objects.Heatmap as in plotly document
the example is not a dendrogram heat map but rather a pair wised distance heat map, you can use the two function to create dendrogram heat map though.

can also use seabornes clustermap
https://seaborn.pydata.org/generated/seaborn.clustermap.html

Plot point on time series line graph

I have this dataframe and I want to line plot it. As I have plotted it.
Graph is
Code to generate is
fig, ax = plt.subplots(figsize=(15, 5))
date_time = pd.to_datetime(df.Date)
df = df.set_index(date_time)
plt.xticks(rotation=90)
pd.DataFrame(df, columns=df.columns).plot.line( ax=ax,
xticks=pd.to_datetime(frame.Date))
I want a marker of innovationScore with value(where innovationScore is not 0) on open, close line. I want to show that that is the change when InnovationScore changes.

You have to address two problems - marking the corresponding spots on your curves and using the dates on the x-axis. The first problem can be solved by identifying the dates, where the score is not zero, then plotting markers on top of the curve at these dates. The second problem is more of a structural nature - pandas often interferes with matplotlib when it comes to date time objects. Using pandas standard plotting functions is good because it addresses common problems. But mixing pandas with matplotlib plotting (and to set the markers, you have to revert to matplotlib afaik) is usually a bad idea because they do not necessarily present the date time in the same format.
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation, the following code block is just for illustration
import numpy as np
np.random.seed(1234)
n = 50
date_range = pd.date_range("20180101", periods=n, freq="D")
choice = np.zeros(10)
choice[0] = 3
df = pd.DataFrame({"Date": date_range,
"Open": np.random.randint(100, 150, n),
"Close": np.random.randint(100, 150, n),
"Innovation Score": np.random.choice(choice, n)})
fig, ax = plt.subplots()
#plot the three curves
l = ax.plot(df["Date"], df[["Open", "Close", "Innovation Score"]])
ax.legend(iter(l), ["Open", "Close", "Innovation Score"])
#filter dataset for score not zero
IS = df[df["Innovation Score"] > 0]
#plot markers on these positions
ax.plot(IS["Date"], IS[["Open", "Close"]], "ro")
#and/or set vertical lines to indicate the position
ax.vlines(IS["Date"], 0, max(df[["Open", "Close"]].max()), ls="--")
#label x-axis score not zero
ax.set_xticks(IS["Date"])
#beautify the output
ax.set_xlabel("Month")
ax.set_ylabel("Artifical score people take seriously")
fig.autofmt_xdate()
plt.show()
Sample output:

You can achieve it like this:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], "ro-") # r is red, o is for larger marker, - is for line
plt.plot([3, 2, 1], "b.-") # b is blue, . is for small marker, - is for line
plt.show()
Check out also example here for another approach:
https://matplotlib.org/3.3.3/gallery/lines_bars_and_markers/markevery_prop_cycle.html
I very often get inspiration from this list of examples:
https://matplotlib.org/3.3.3/gallery/index.html

Add axhline to legend

I'm creating a lineplot from a dataframe with seaborn and I want to add a horizontal line to the plot. That works fine, but I am having trouble adding the horizontal line to the legend.
Here is a minimal, verifiable example:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.array([2, 2, 4, 4])
y = np.array([5, 10, 10, 15])
isBool = np.array([True, False, True, False])
data = pd.DataFrame(np.column_stack((x, y, isBool)), columns=["x", "y", "someBoolean"])
print(data)
ax = sns.lineplot(x="x", y="y", hue="someBoolean", data=data)
plt.axhline(y=7, c='red', linestyle='dashed', label="horizontal")
plt.legend(("some name", "some other name", "horizontal"))
plt.show()
This results in the following plot:
The legends for "some name" and "some other name" show up correctly, but the "horizontal" legend is just blank. I tried simply using plt.legend() but then the legend consists of seemingly random values from the dataset.
Any ideas?

Simply using plt.legend() tells you what data is being plotting:
You are using someBoolean as the hue. So you are essentially creating two lines by applying a Boolean mask to your data. One line is for values that are False (shown as 0 on the legend above), the other for values that are True (shown as 1 on the legend above).
In order to get the legend you want you need to set the handles and the labels. You can get a list of them using ax.get_legend_handles_labels(). Then make sure to omit the first handle which, as shown above, has no artist:
ax = sns.lineplot(x="x", y="y", hue="someBoolean", data=data)
plt.axhline(y=7, c='red', linestyle='dashed', label="horizontal")
labels = ["some name", "some other name", "horizontal"]
handles, _ = ax.get_legend_handles_labels()
# Slice list to remove first handle
plt.legend(handles = handles[1:], labels = labels)
This gives:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plot outliers using matplotlib and seaborn - python

Related

Scatter plot to show the majorities and include extreme numbers

Create multiple boxplots from statistics in one graph

Plotly clustered heatmap (with dendrogram)/Python

Plot point on time series line graph

Add axhline to legend

Categories

Resources