plot ellipse in a seaborn scatter plot - python

I have a data frame in pandas format (pd.DataFrame) with columns = [z1,z2,Digit], and I did a scatter plot in seaborn:
dataframe = dataFrame.apply(pd.to_numeric, errors='coerce')
sns.lmplot("z1", "z2", data=dataframe, hue='Digit', fit_reg=False, size=10)
plt.show()
What I want to is plot an ellipse around each of these points. But I can't seem to plot an ellipse in the same figure.
I know the normal way to plot an ellipse is like:
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
elps = Ellipse((0, 0), 4, 2,edgecolor='b',facecolor='none')
a = plt.subplot(111, aspect='equal')
a.add_artist(elps)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.show()
But because I have to do "a = plt.subplot(111, aspect='equal')", the plot will be on a different figure. And I also can't do:
a = sns.lmplot("z1", "z2", data=rect, hue='Digit', fit_reg=False, size=10)
a.add_artist(elps)
because the 'a' returned by sns.lmplot() is of "seaborn.axisgrid.FacetGrid" object. Any solutions? Is there anyway I can plot an ellipse without having to something like a.set_artist()?

Seaborn's lmplot() used a FacetGrid object to do the plot, and therefore your variable a = lm.lmplot(...) is a reference to that FacetGrid object.
To add your elipse, you need a refence to the Axes object. The problem is that a FacetGrid can contain multiple axes depending on how you split your data. Thankfully there is a function FacetGrid.facet_axis(row_i, col_j) which can return a reference to a specific Axes object.
In your case, you would do:
a = sns.lmplot("z1", "z2", data=rect, hue='Digit', fit_reg=False, size=10)
ax = a.facet_axis(0,0)
ax.add_artist(elps)

Related

Overlaying Pandas plot with Matplotlib is sensitive to the plotting order

I have the following problem: I'm trying to overlay two plots: One Pandas plot via plot.area() for a dataframe, and a second plot that is a standard Matplotlib plot. Depending the coder order for those two, the Matplotlib plot is displayed only if the code is before the Pandas plot.area() on the same axes.
Example: I have a Pandas dataframe called revenue that has a DateTimeIndex, and a single column with "revenue" values (float). Separately I have a dataset called projection with data along the same index (revenue.index)
If the code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Pandas area plot
revenue.plot.area(ax = ax)
# Second -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
plt.tight_layout()
plt.show()
Then the only thing displayed is the pandas plot.area() like this:
1/ Pandas plot.area() and 2/ Matplotlib line plot
However, if the order of the plotting is reversed:
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
# Second -- Pandas area plot
revenue.plot.area(ax = ax)
plt.tight_layout()
plt.show()
Then the plots are overlayed properly, like this:
1/ Matplotlib line plot and 2/ Pandas plot.area()
Can someone please explain me what I'm doing wrong / what do I need to do to make the code more robust ? Kind TIA.
The values on the x-axis are different in both plots. I think DataFrame.plot.area() formats the DateTimeIndex in a pretty way, which is not compatible with pyplot.plot().
If you plot of the projection first, plot.area() can still plot the data and does not format the x-axis.
Mixing the two seems tricky to me, so I would either use pyplot or Dataframe.plot for both the area and the line:
import pandas as pd
from matplotlib import pyplot as plt
projection = [1000, 2000, 3000, 4000]
datetime_series = pd.to_datetime(["2021-12","2022-01", "2022-02", "2022-03"])
datetime_index = pd.DatetimeIndex(datetime_series.values)
revenue = pd.DataFrame({"value": [1200, 2200, 2800, 4100]})
revenue = revenue.set_index(datetime_index)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# Option 1: only pyplot
ax[0].fill_between(revenue.index, revenue.value)
ax[0].plot(revenue.index, projection, color='black', linewidth=3)
ax[0].set_title("Pyplot")
# Option 2: only DataFrame.plot
revenue["projection"] = projection
revenue.plot.area(y='value', ax=ax[1])
revenue.plot.line(y='projection', ax=ax[1], color='black', linewidth=3)
ax[1].set_title("DataFrame.plot")
The results then look like this, where DataFrame.plot gives a much cleaner looking result:
If you do not want the projection in the revenue DataFrame, you can put it in a separate DataFrame and set the index to match revenue:
projection_df = pd.DataFrame({"projection": projection})
projection_df = projection_df.set_index(datetime_index)
projection_df.plot.line(ax=ax[1], color='black', linewidth=3)

set custom tick labels on heatmap color bar

I have a list of dataframes named merged_dfs that I am looping through to get the correlation and plot subplots of heatmap correlation matrix using seaborn.
I want to customize the colorbar tick labels, but I am having trouble figuring out how to do it with my example.
Currently, my colorbar scale values from top to bottom are
[1,0.5,0,-0.5,-1]
I want to keep these values, but change the tick labels to be
[1,0.5,0,0.5,1]
for my diverging color bar.
Here is the code and my attempt:
fig, ax = plt.subplots(nrows=6, ncols=2, figsize=(20,20))
for i, (title,merging) in enumerate (zip(new_name_data,merged_dfs)):
graph = merging.corr()
colormap = sns.diverging_palette(250, 250, as_cmap=True)
a = sns.heatmap(graph.abs(), cmap=colormap, vmin=-1,vmax=1,center=0,annot = graph, ax=ax.flat[i])
cbar = fig.colorbar(a)
cbar.set_ticklabels(["1","0.5","0","0.5","1"])
fig.delaxes(ax[5,1])
plt.show()
plt.close()
I keep getting this error:
AttributeError: 'AxesSubplot' object has no attribute 'get_array'
Several things are going wrong:
fig.colorbar(...) would create a new colorbar, by default appended to the last subplot that was created.
sns.heatmap returns an ax (indicates a subplot). This is very different to matplotlib functions, e.g. plt.imshow(), which would return the graphical element that was plotted.
You can suppress the heatmap's colorbar (cbar=False), and then create it newly with the parameters you want.
fig.colorbar(...) needs a parameter ax=... when the figure contains more than one subplot.
Instead of creating a new colorbar, you can add the colorbar parameters to sns.heatmap via cbar_kws=.... The colorbar itself can be found via ax.collections[0].colobar. (ax.collections[0] is where matplotlib stored the graphical object that contains the heatmap.)
Using an index is strongly discouraged when working with Python. It's usually more readable, easier to maintain and less error-prone to include everything into the zip command.
As now your vmin now is -1, taking the absolute value for the coloring seems to be a mistake.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
merged_dfs = [pd.DataFrame(data=np.random.rand(5, 7), columns=[*'ABCDEFG']) for _ in range(5)]
new_name_data = [f'Dataset {i + 1}' for i in range(len(merged_dfs))]
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 7))
for title, merging, ax in zip(new_name_data, merged_dfs, axes.flat):
graph = merging.corr()
colormap = sns.diverging_palette(250, 250, as_cmap=True)
sns.heatmap(graph, cmap=colormap, vmin=-1, vmax=1, center=0, annot=True, ax=ax, cbar_kws={'ticks': ticks})
ax.collections[0].colorbar.set_ticklabels([abs(t) for t in ticks])
fig.delaxes(axes.flat[-1])
fig.tight_layout()
plt.show()

Plotting two dataframes obtained from a loop in the same graph Python

I would like to plot two dfs with two different colors. For each df, I would need to add two markers. Here is what I have tried:
for stats_file in stats_files:
data = Graph(stats_file)
Graph.compute(data)
data.servers_df.plot(x="time", y="percentage", linewidth=1, kind='line')
plt.plot(data.first_measurement['time'], data.first_measurement['percentage'], 'o-', color='orange')
plt.plot(data.second_measurement['time'], data.second_measurement['percentage'], 'o-', color='green')
plt.show()
Using this piece of code, I get the servers_df plotted with markers, but on separate graphs.
How I can have both graphs in a single one to compare them better?
Thanks.
TL;DR
Your call to data.servers_df.plot() always creates a new plot, and plt.plot() plots on the latest plot that was created. The solution is to create dedicated axis for everything to plot onto.
Preface
I assumed your variables are the following
data.servers_df: Dataframe with two float columns "time" and "percentage"
data.first_measurements: A dictionary with keys "time" and `"percentage", which each are a list of floats
data.second_measurements: A dictionary with keys "time" and "percentage", which each are a list of floats
I skipped generating stat_files as you did not show what Graph() does, but just created a list of dummy data.
If data.first_measurements and data.second_measurements are also dataframes, let me know and there is an even nicer solution.
Theory - Behind the curtains
Each matplotlib plot (line, bar, etc.) lives on a matplotlib.axes.Axes element. These are like regular axes of a coordinate system. Now two things happen here:
When you use plt.plot(), there are no axes specified and thus, matplotlib looks up the current axes element (in the background), and if there is none, it will create an empty one and use it, and set is as default. The second call to plt.plot() then finds these axes and uses them.
DataFrame.plot() on the other hand, always creates a new axes element if none is given to it (possible through the ax argument)
So in your code, data.servers_df.plot() first creates an axes element behind the curtains (which is then the default), and the two following plt.plot() calls get the default axes and plot onto it - which is why you get two plots instead of one.
Solution
The following solution first creates a dedicated matplotlib.axes.Axes using plt.subplots(). This axis element is then used to draw all lines onto. Note especially the ax=ax in data.server_df.plot(). Note that I changed the display of your markers from o- to o (as we don't want to display a line (-) but only markers (o)).
Mock data can be found below
fig, ax = plt.subplots() # Here we create the axes that all data will plot onto
for i, data in enumerate(stat_files):
y_column = f'percentage_{i}' # Make the columns identifiable
data.servers_df \
.rename(columns={'percentage': y_column}) \
.plot(x='time', y=y_column, linewidth=1, kind='line', ax=ax)
ax.plot(data.first_measurement['time'], data.first_measurement['percentage'], 'o', color='orange')
ax.plot(data.second_measurement['time'], data.second_measurement['percentage'], 'o', color='green')
plt.show()
Mock data
import random
import pandas as pd
import matplotlib.pyplot as plt
# Generation of dummy data
random.seed(1)
NUMBER_OF_DATA_FILES = 2
X_LENGTH = 10
class Data:
def __init__(self):
self.servers_df = pd.DataFrame(
{
'time': range(X_LENGTH),
'percentage': [random.randint(0, 10) for _ in range(X_LENGTH)]
}
)
self.first_measurement = {
'time': self.servers_df['time'].values[:X_LENGTH // 2],
'percentage': self.servers_df['percentage'].values[:X_LENGTH // 2]
}
self.second_measurement = {
'time': self.servers_df['time'].values[X_LENGTH // 2:],
'percentage': self.servers_df['percentage'].values[X_LENGTH // 2:]
}
stat_files = [Data() for _ in range(NUMBER_OF_DATA_FILES)]
DataFrame.plot() by default returns a matplotlib.axes.Axes object. You should then plot the other two plots on this object:
for stats_file in stats_files:
data = Graph(stats_file)
Graph.compute(data)
ax = data.servers_df.plot(x="time", y="percentage", linewidth=1, kind='line')
ax.plot(data.first_measurement['time'], data.first_measurement['percentage'], 'o-', color='orange')
ax.plot(data.second_measurement['time'], data.second_measurement['percentage'], 'o-', color='green')
plt.show()
If you want to plot them one on top of the others with different colors you can do something like this:
colors = ['C0', 'C1', 'C2'] # matplotlib default color palette
# assuming that len(stats_files) = 3
# if not you need to specify as many colors as necessary
ax = plt.subplot(111)
for stats_file, c in zip(stats_files, colors):
data = Graph(stats_file)
Graph.compute(data)
data.servers_df.plot(x="time", y="percentage", linewidth=1, kind='line', ax=ax)
ax.plot(data.first_measurement['time'], data.first_measurement['percentage'], 'o-', color=c)
ax.plot(data.second_measurement['time'], data.second_measurement['percentage'], 'o-', color='green')
plt.show()
This just changes the color of the servers_df.plot. If you want to change the color of the other two you can just to the same logic: create a list of colors that you want them to take at each iteration, iterate over that list and pass the color value to the color param at each iteration.
You can create an Axes object for plotting in the first place, for example
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df_one = pd.DataFrame({'a':np.linspace(1,10,10),'b':np.linspace(1,10,10)})
df_two = pd.DataFrame({'a':np.random.randint(0,20,10),'b':np.random.randint(0,5,10)})
dfs = [df_one,df_two]
fig,ax = plt.subplots(figsize=(8,6))
colors = ['navy','darkviolet']
markers = ['x','o']
for ind,item in enumerate(dfs):
ax.plot(item['a'],item['b'],c=colors[ind],marker=markers[ind])
as you can see, in the same ax, the two dataframes are plotted with different colors and markers.
You need to create the plot before.
Afterwards, you can explicitly refer to this plot while plotting the graphs.
df.plot(..., ax=ax) or ax.plot(x, y)
import matplotlib.pyplot as plt
(fig, ax) = plt.subplots(figsize=(20,5))
for stats_file in stats_files:
data = Graph(stats_file)
Graph.compute(data)
data.servers_df.plot(x="time", y="percentage", linewidth=1, kind='line', ax=ax)
ax.plot(data.first_measurement['time'], data.first_measurement['percentage'], 'o-', color='orange')
ax.plot(data.second_measurement['time'], data.second_measurement['percentage'], 'o-', color='green')
plt.show()

Superimposing plots in seaborn cause x-axis to misallign

I am having an issue trying to superimpose plots with seaborn. I am able to generate the two plots separetly as
fig, (ax1,ax2) = plt.subplots(ncols=2,figsize=(30, 7))
sns.lineplot(data=data1, y='MSE',x='pct_gc',ax=ax1)
sns.boxplot(x="pct_gc", y="MSE", data=data2,ax=ax2,width=0.4)
The output looks like this:
But when i try to put both plots superimposed, but assiging both to the same ax object.
fig, (ax1,ax2) = plt.subplots(ncols=2,figsize=(30, 7))
sns.lineplot(data=data1, y='MSE',x='pct_gc',ax=ax1)
sns.boxplot(x="pct_gc", y="MSE", data=data2,ax=ax2,width=0.4)
I am not able to identify with the X axis in the Lineplot changes when superimposing both plots (both plots X axis go from 0 to 0.069).
My goal is for both plots to be superimposed, while keeping the same X axis range.
Seaborn's boxplot creates categorical x-axis, with all boxes nicely with the same distance. Internally the x-axis is numbered as 0, 1, 2, ... but externally it gets the labels from 0 to 0.069.
To combine a line plot with a boxplot, matplotlib's boxplot can be addressed directly, so that positions and widths can be set explicitly. When patch_artist=True, a rectangle is created (instead of just lines), for which a facecolor can be given. manage_ticks=False prevents that boxplot changes the x ticks and their limits. Optionally notch=True would accentuate the median a bit more, but depending on the data, the confidence interval might be too large and look weird.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
data1 = pd.DataFrame({'pct_gc': np.linspace(0, 0.069, 200), 'MSE': np.random.normal(0.02, 0.1, 200).cumsum()})
data1['pct_range'] = pd.cut(data1['pct_gc'], 10)
fig, ax1 = plt.subplots(ncols=1, figsize=(20, 7))
sns.lineplot(data=data1, y='MSE', x='pct_gc', ax=ax1)
for interval, color in zip(np.unique(data1['pct_range']), plt.cm.tab10.colors):
ax1.boxplot(data1[data1['pct_range'] == interval]['MSE'],
positions=[interval.mid], widths=0.4 * interval.length,
patch_artist=True, boxprops={'facecolor': color},
notch=False, medianprops={'color':'yellow', 'linewidth':2},
manage_ticks=False)
plt.show()

How do I overlay multiple plot types (bar + scatter) in one figure, sharing x-axis

I am attempting to overlay two graphs, a bar graph and a scatter plot, that share an x-axis, but have separate y-axis on either side of the graph. I have tried using matplotlib, ggplot, and seaborn, but I am having the same problem with all of them. I can graph them both separately, and they graph correctly, but when I try to graph them together, the bar graph is correct, but, only a couple data points from the scatter plot show up. I have zoomed in and can confirm that almost none of the scatter-plot points are appearing.
Here is my code. I have loaded a pandas dataframe and am trying to graph 'dKO_Log2FC' as a bar graph, and 'TTCAAG' as a scatter plot. They both share 'bin_end' postion on the x-axis. If I comment out sns.barplot, the scatter plot graphs perfectly. If I comment out the sns.scatterplot, the bar plot graphs as well. When I graph them together without commenting out either, the bar graph graphs, but only two datapoints from 'TTCAAG' column show up. I have played with with size of the scatter dots, zoomed in, etc, but nothing working.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
file = pd.read_csv('path/to/csv_file.csv')
df = pd.DataFrame(file, columns=['bin_end', 'TTCAAG', 'dKO_Log2FC'])
bin_end = df['bin_end']
TTCAAG = df['TTCAAG']
dKO_Log2FC = df['dKO_Log2FC']
fig, ax = plt.subplots()
ax2 = ax.twinx()
sns.barplot(x=bin_end, y=dKO_Log2FC, ax=ax, color="blue", data=df)
sns.scatterplot(x=bin_end, y=TTCAAG, ax=ax2, color="red", data=df)
plt.title('Histone Position in TS559 vs dKO')
plt.xlabel('Genomic Position (Bin = 1000nt)', fontsize=10)
plt.xticks([])
plt.ylabel('Log2 Fold Change', fontsize=10)
plt.show()
I have have no idea why this the scatter plot won't completely graph. The dataset is quite large, but even when I break it up into smaller bits, only a few scatter points show up.
Here are the graphs
I`m not sure what is the problem, I think is something related to the amount of data or some other data related problem, however as you can plot the data separately, you can generate an image for each plot and then blend the two images to get the required plot.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
from PIL import Image
npoints=200
xRange=np.arange(0,npoints,1)
randomdata0=np.abs(np.random.normal(0,1,npoints))
randomdata1=np.random.normal(10,1,npoints)
axtick=[7,10,14]
ax2tick=[0,1.5,3]
fig0=plt.figure(0)
ax=fig0.gca()
ax2=ax.twinx()
sns.scatterplot(x=xRange,y=randomdata1,ax=ax)
ax.set_yticks(axtick)
ax.set_ylim([6,15])
ax2.set_yticks(ax2tick)
ax2.set_ylim([0,3.5])
plt.xticks([])
canvas0 = FigureCanvas(fig0)
s, (width, height) = canvas0.print_to_buffer()
X0 = Image.frombytes("RGBA", (width, height), s) #Contains the data of the first plot
fig1=plt.figure(1)
ax=fig1.gca()
ax2=ax.twinx()
sns.barplot(x=xRange,y=randomdata0,ax=ax2)
ax.set_yticks(axtick)
ax.set_ylim([6,15])
ax2.set_yticks(ax2tick)
ax2.set_ylim([0,3.5])
plt.xticks([])
canvas1 = FigureCanvas(fig1)
s, (width, height) = canvas1.print_to_buffer()
X1 = Image.frombytes("RGBA", (width, height), s) #Contains the data of the second plot
plt.figure(13,figsize=(10,10))
plt.imshow(Image.blend(X0,X1,0.5),interpolation='gaussian')
Axes=plt.gca()
Axes.spines['top'].set_visible(False)
Axes.spines['right'].set_visible(False)
Axes.spines['bottom'].set_visible(False)
Axes.spines['left'].set_visible(False)
Axes.set_xticks([])
Axes.set_yticks([])
Just remember to set the twin axes with the same range and ticks in both plots, otherwise, there will be some shift in the images and the numbers will not align.
Hope it helps

Categories

Resources