I am trying to create a Manhattan plot that will be vertically highlighted at certain parts of the plot given a list of values corresponding to points in the scatter plot. I looked at several examples but I am not sure how to proceed. I think using axvspan or ax.fill_between should work but I am not sure how. The code below was lifted directly from
How to create a Manhattan plot with matplotlib in python?
from pandas import DataFrame
from scipy.stats import uniform
from scipy.stats import randint
import numpy as np
import matplotlib.pyplot as plt
# some sample data
df = DataFrame({'gene' : ['gene-%i' % i for i in np.arange(10000)],
'pvalue' : uniform.rvs(size=10000),
'chromosome' : ['ch-%i' % i for i in randint.rvs(0,12,size=10000)]})
# -log_10(pvalue)
df['minuslog10pvalue'] = -np.log10(df.pvalue)
df.chromosome = df.chromosome.astype('category')
df.chromosome = df.chromosome.cat.set_categories(['ch-%i' % i for i in range(12)], ordered=True)
df = df.sort_values('chromosome')
# How to plot gene vs. -log10(pvalue) and colour it by chromosome?
df['ind'] = range(len(df))
df_grouped = df.groupby(('chromosome'))
fig = plt.figure()
ax = fig.add_subplot(111)
colors = ['red','green','blue', 'yellow']
x_labels = []
x_labels_pos = []
for num, (name, group) in enumerate(df_grouped):
group.plot(kind='scatter', x='ind', y='minuslog10pvalue',color=colors[num % len(colors)], ax=ax)
x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2))
ax.set_xlim([0, len(df)])
ax.set_ylim([0, 3.5])
given a list of values of the point, pvalues e.g
lst = [0.288686, 0.242591, 0.095959, 3.291343, 1.526353]
How do I highlight the region containing these points on the plot just as shown in green in the image below? Something similar to:
It would help if you have a sample of your dataframe for your reference.
Assuming you want to match your lst values with Y values, you need to iterate through each Y value you're plotting and check if they are within lst.
for num, (name, group) in enumerate(df_grouped):
group Variable in your code are essentially partial dataframes of your main dataframe, df. Hence, you need to put in another loop to look through all Y values for lst matches
region_plot = []
for num, (name, group) in enumerate(a.groupby('group')):
group.plot(kind='scatter', x='ind', y='minuslog10pvalue',color=colors[num % len(colors)], ax=ax)
#create a new df to get only rows that have matched values with lst
temp_group = group[group['minuslog10pvalue'].isin(lst)]
for x_group in temp_group['ind']:
#If condition to make sure same region is not highlighted again
if x_group not in region_plot:
ax.axvspan(x_group, x_group+1, alpha=0.5, color='green')
#I put x_group+1 because I'm not sure how big of a highlight range you want
Hope this helps!
I currently am building a set of scatter plot charts using pandas plot.scatter. In this construction off of two base axes.
My current construction looks akin to
ax1 = pandas.scatter.plot()
ax2 = pandas.scatter.plot(ax=ax1)
for dataframe in list:
output_ax = pandas.scatter.plot(ax2)
total_output_ax = total_list.scatter.plot(ax2)
This seems inefficient. For 1...N permutations I want to reuse a base axes that has 50% of the data already plotted. What I am trying to do is:
Add base data to scatter plot
For item x in y: (save data to base scatter and save image)
Add all data to scatter plot and save image
here's one way to do it with plt.scatter.
I plot column 0 on x-axis, and all other columns on y axis, one at a time.
Notice that there is only 1 ax object, and I don't replot all points, I just add points using the same axes with a for loop.
Each time I get a corresponding png image.
import numpy as np
import pandas as pd
testdf = pd.DataFrame(np.random.rand(20,4))
testdf.head(5) looks like this
0 1 2 3
0 0.435995 0.025926 0.549662 0.435322
1 0.420368 0.330335 0.204649 0.619271
2 0.299655 0.266827 0.621134 0.529142
3 0.134580 0.513578 0.184440 0.785335
4 0.853975 0.494237 0.846561 0.079645
#I put the first axis out of a loop, that can be in the loop as well
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(testdf[0],testdf[1], color='red')
colors = ['pink', 'green', 'black', 'blue']
for i in range(2,4):
ax.scatter(testdf[0], testdf[i], color=colors[i])
fig.savefig('full_' + str(i) + '.png')
Then you get these 3 images (fig_1, fig_2, fig_3)
Axes objects cannot be simply copied or transferred. However, it is possible to set artists to visible/invisible in a plot. Given your ambiguous question, it is not fully clear how your data are stored but it seems to be a list of dataframes. In any case, the concept can easily be adapted to different input data.
import matplotlib.pyplot as plt
#test data generation
import pandas as pd
import numpy as np
rng = np.random.default_rng(123456)
df_list = [pd.DataFrame(rng.integers(0, 100, (7, 2))) for _ in range(3)]
#plot all dataframes into an axis object to ensure
#that all plots have the same scaling
fig, ax = plt.subplots()
patch_collections = []
for i, df in enumerate(df_list):
pc = ax.scatter(x=df[0], y=df[1], label=str(i))
#store individual plots
for i, pc in enumerate(patch_collections):
ax.set_title(f"Dataframe {i}")
#store summary plot
[pc.set_visible(True) for pc in patch_collections]
ax.set_title("All dataframes")
The code below achieves what I want to do, but does so in a very roundabout way. I have looked around for a succinct way to produce a single legend for a figure that includes multiple subplots that takes into account their labels, to no avail. plt.figlegend() requires you to pass in labels and lines, and plt.legend() requires only handles (slightly better).
My example below illustrates what I want. I have 9 vectors, each with one of 3 categories. I want to plot each vector on a separate sub plot, label it, and plot a legend which indicates (using colour) what the label means; this is the automatic behaviour on a single plot.
Do you know of a better way of achieving the plot below?
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
nr_lines = 9
nr_cats = 3
# Data
X = np.random.randn(nr_lines, 100)
labels = ['Category {}'.format(ii) for ii in range(nr_cats)]
y = np.random.choice(labels, nr_lines)
# Ideally wouldn't have to manually pick colours
clrs = matplotlib.rcParams['axes.prop_cycle'].by_key()['color']
clrs = [clrs[ii] for ii in range(nr_cats)]
lab_clr = {k: v for k, v in zip(labels, clrs)}
fig, ax = plt.subplots(3, 3)
ax = ax.flatten()
for ii in range(nr_lines):
ax[ii].plot(X[ii,:], label=y[ii], color=lab_clr[y[ii]])
lines = [a.lines[0] for a in ax]
l_labels = [l.get_label() for l in lines]
# the hack - get a single occurance of each label
idx_list = [l_labels.index(lab) for lab in labels]
lines_ = [lines[idx] for idx in idx_list]
#l_labels_ = [l_labels[idx] for idx in idx_list]
plt.legend(handles=lines_, bbox_to_anchor=[2, 2.5])
You could use a dictionary to collect them using the label as a key. For example:
handles = {}
for ii in range(nr_lines):
l1, = ax[ii].plot(X[ii,:], label=y[ii], color=lab_clr[y[ii]])
if y[ii] not in handles:
handles[y[ii]] = l1
plt.legend(handles=handles.values(), bbox_to_anchor=[2, 2.5])
You only add a handle to the dictionary if the category isn't already present.
I am plotting multiple dataframes as point plot using seaborn. Also I am plotting all the dataframes on the same axis.
How would I add legend to the plot ?
My code takes each of the dataframe and plots it one after another on the same figure.
Each dataframe has same columns
date count
2017-01-01 35
2017-01-02 43
2017-01-03 12
2017-01-04 27
My code :
f, ax = plt.subplots(1, 1, figsize=figsize)
y_col = 'count'
This plots 3 lines on the same plot. However the legend is missing. The documentation does not accept label argument .
One workaround that worked was creating a new dataframe and using hue argument.
df_1['region'] = 'A'
df_2['region'] = 'B'
df_3['region'] = 'C'
df = pd.concat([df_1,df_2,df_3])
But I would like to know if there is a way to create a legend for the code that first adds sequentially point plot to the figure and then add a legend.
Sample output :
I would suggest not to use seaborn pointplot for plotting. This makes things unnecessarily complicated.
Instead use matplotlib plot_date. This allows to set labels to the plots and have them automatically put into a legend with ax.legend().
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
date = pd.date_range("2017-03", freq="M", periods=15)
count = np.random.rand(15,4)
df1 = pd.DataFrame({"date":date, "count" : count[:,0]})
df2 = pd.DataFrame({"date":date, "count" : count[:,1]+0.7})
df3 = pd.DataFrame({"date":date, "count" : count[:,2]+2})
f, ax = plt.subplots(1, 1)
y_col = 'count'
ax.plot_date(df1.date, df1["count"], color="blue", label="A", linestyle="-")
ax.plot_date(df2.date, df2["count"], color="red", label="B", linestyle="-")
ax.plot_date(df3.date, df3["count"], color="green", label="C", linestyle="-")
In case one is still interested in obtaining the legend for pointplots, here a way to go:
ax.legend(handles=ax.lines[::len(df1)+1], labels=["A","B","C"])
ax.set_xticklabels([t.get_text().split("T")[0] for t in ax.get_xticklabels()])
Old question, but there's an easier way.
plt.legend(labels=['legendEntry1', 'legendEntry2', 'legendEntry3'])
This lets you add the plots sequentially, and not have to worry about any of the matplotlib crap besides defining the legend items.
I tried using Adam B's answer, however, it didn't work for me. Instead, I found the following workaround for adding legends to pointplots.
import matplotlib.patches as mpatches
red_patch = mpatches.Patch(color='#bb3f3f', label='Label1')
black_patch = mpatches.Patch(color='#000000', label='Label2')
In the pointplots, the color can be specified as mentioned in previous answers. Once these patches corresponding to the different plots are set up,
plt.legend(handles=[red_patch, black_patch])
And the legend ought to appear in the pointplot.
This goes a bit beyond the original question, but also builds on #PSub's response to something more general---I do know some of this is easier in Matplotlib directly, but many of the default styling options for Seaborn are quite nice, so I wanted to work out how you could have more than one legend for a point plot (or other Seaborn plot) without dropping into Matplotlib right at the start.
Here's one solution:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# We will need to access some of these matplotlib classes directly
from matplotlib.lines import Line2D # For points and lines
from matplotlib.patches import Patch # For KDE and other plots
from matplotlib.legend import Legend
from matplotlib import cm
# Initialise random number generator
rng = np.random.default_rng(seed=42)
# Generate sample of 25 numbers
n = 25
clusters = []
for c in range(0,3):
# Crude way to get different distributions
# for each cluster
p = rng.integers(low=1, high=6, size=4)
df = pd.DataFrame({
'x': rng.normal(p[0], p[1], n),
'y': rng.normal(p[2], p[3], n),
'name': f"Cluster {c+1}"
# Flatten to a single data frame
clusters = pd.concat(clusters)
# Now do the same for data to feed into
# the second (scatter) plot...
n = 8
points = []
for c in range(0,2):
p = rng.integers(low=1, high=6, size=4)
df = pd.DataFrame({
'x': rng.normal(p[0], p[1], n),
'y': rng.normal(p[2], p[3], n),
'name': f"Group {c+1}"
points = pd.concat(points)
# And create the figure
f, ax = plt.subplots(figsize=(8,8))
# The KDE-plot generates a Legend 'as usual'
k = sns.kdeplot(
x='x', y='y',
# Notice that we access this legend via the
# axis to turn off the frame, set the title,
# and adjust the patch alpha level so that
# it closely matches the alpha of the KDE-plot
for lh in ax.get_legend().get_patches():
# You would probably want to sort your data
# frame or set the hue and style order in order
# to ensure consistency for your own application
# but this works for demonstration purposes
groups = points.name.unique()
markers = ['o', 'v', 's', 'X', 'D', '<', '>']
colors = cm.get_cmap('Dark2').colors
# Generate the scatterplot: notice that Legend is
# off (otherwise this legend would overwrite the
# first one) and that we're setting the hue, style,
# markers, and palette using the 'name' parameter
# from the data frame and the number of groups in
# the data.
p = sns.scatterplot(
# Here's the 'magic' -- we use zip to link together
# the group name, the color, and the marker style. You
# *cannot* retreive the marker style from the scatterplot
# since that information is lost when rendered as a
# PathCollection (as far as I can tell). Anyway, this allows
# us to loop over each group in the second data frame and
# generate a 'fake' Line2D plot (with zero elements and no
# line-width in our case) that we can add to the legend. If
# you were overlaying a line plot or a second plot that uses
# patches you'd have to tweak this accordingly.
patches = []
for x in zip(groups, colors[:len(groups)], markers[:len(groups)]):
patches.append(Line2D([0],[0], linewidth=0.0, linestyle='',
color=x[1], markerfacecolor=x[1],
marker=x[2], label=x[0], alpha=1.0))
# And add these patches (with their group labels) to the new
# legend item and place it on the plot.
leg = Legend(ax, patches, labels=groups,
loc='upper left', frameon=False, title='Groups')
# Done
Here's the output:
I'm very new to python but am interested in learning a new technique whereby I can identify different data points in a scatter plot with different markers according to where they fall in the scatter plot.
My specific example is much to this: http://www.astroml.org/examples/datasets/plot_sdss_line_ratios.html
I have a BPT plot and want to split the data along the demarcation line line.
I have a data set in this format:
data = [[a,b,c],
And I also have the following for the demarcation line:
NII = np.linspace(-3.0, 0.35)
def log_OIII_Hb_NII(log_NII_Ha, eps=0):
return 1.19 + eps + 0.61 / (log_NII_Ha - eps - 0.47)
Any help would be great!
There was not enough room in the comments section. Not too dissimilar to what #DrV wrote, but maybe more astronomically inclined:
import random
import numpy as np
import matplotlib.pyplot as plt
def log_OIII_Hb_NII(log_NII_Ha, eps=0):
return 1.19 + eps + 0.61 / (log_NII_Ha - eps - 0.47)
# Make some fake measured NII_Ha data
iternum = 100
# Ranged -2.1 to 0.4:
Measured_NII_Ha = np.array([random.random()*2.5-2.1 for i in range(iternum)])
# Ranged -1.5 to 1.5:
Measured_OIII_Hb = np.array([random.random()*3-1.5 for i in range(iternum)])
# For our measured x-value, what is our cut-off value
Measured_Predicted_OIII_Hb = log_OIII_Hb_NII(Measured_NII_Ha)
# Now compare the cut-off line to the measured emission line fluxes
# by using numpy True/False arrays
# i.e., x = numpy.array([1,2,3,4])
# >> index = x >= 3
# >> print(index)
# >> numpy.array([False, False, True, True])
# >> print(x[index])
# >> numpy.array([3,4])
Above_Predicted_Red_Index = Measured_OIII_Hb > Measured_Predicted_OIII_Hb
Below_Predicted_Blue_Index = Measured_OIII_Hb < Measured_Predicted_OIII_Hb
# Alternatively, you can invert Above_Predicted_Red_Index
# Make the cut-off line for a range of values for plotting it as
# a continuous line
Predicted_NII_Ha = np.linspace(-3.0, 0.35)
Predicted_log_OIII_Hb_NII = log_OIII_Hb_NII(Predicted_NII_Ha)
fig = plt.figure(0)
ax = fig.add_subplot(111)
# Plot the modelled cut-off line
ax.plot(Predicted_NII_Ha, Predicted_log_OIII_Hb_NII, color="black", lw=2)
# Plot the data for a given colour
ax.errorbar(Measured_NII_Ha[Above_Predicted_Red_Index], Measured_OIII_Hb[Above_Predicted_Red_Index], fmt="o", color="red")
ax.errorbar(Measured_NII_Ha[Below_Predicted_Blue_Index], Measured_OIII_Hb[Below_Predicted_Blue_Index], fmt="o", color="blue")
# Make it aesthetically pleasing
ax.set_ylabel(r"$\rm \log([OIII]/H\beta)$")
ax.set_xlabel(r"$\rm \log([NII]/H\alpha)$")
I assume you have the pixel coordinates as a, b in your example. The column with cs is then something that is used to calculate whether a point belongs to one of the two groups.
Make your data first an ndarray:
import numpy as np
data = np.array(data)
Now you may create two arrays by checking which part of the data belongs to which area:
dataselector = log_OIII_Hb_NII(data[:,2]) > 0
This creates a vector of Trues and Falses which has a True whenever the data in the third column (column 2) gives a positive value from the function. The length of the vector equals to the number of rows in data.
Then you can plot the two data sets:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
# the plotting part
ax.plot(data[dataselector,0], data[dataselector,1], 'ro')
ax.plot(data[-dataselector,0], data[-dataselector,1], 'bo')
create a list of True/False values which tells which rows of data belong to which group
plot the two groups (-dataselector means "all the rows where there is a False in dataselector")