Creating a phylogenetic tree with domain annotations using BioPython

Creating a phylogenetic tree with domain annotations using BioPython - python

I want to create a figure like so:
Example of figure I would like to create
Here is some dummy data and attempt so far to go about this:
import io
import matplotlib.pyplot as plt
from Bio import Phylo
# input data
treedata = "(A, (B, C))"
handle = io.StringIO(treedata)
tree = Phylo.read(handle, "newick")
# domains = [[speciesreference, full length of protein sequence, [domain reference code, start position, end position], [speciesreference, full length of protein sequence, [domain reference code, start position, end position]]
domains = [['A', 150, ['IPR000001', 10, 15], ['IPR000002', 20, 40], ['IPR000003', 70, 130]],
['B', 300, ['IPR000001', 70, 150], ['IPR000002', 29, 40], ['IPR000003', 100, 200]],
['C', 100, ['IPR000001', 5, 15], ['IPR000002', 25, 30], ['IPR000003', 27, 90]]]
# create figure and subplots
fig = plt.figure(figsize=(6, 6), dpi=300)
ax1 = fig.add_subplot(1, 2, 1) # left axis
ax2 = fig.add_subplot(1, 2, 2, sharey=ax1) # right axis
# draw dendrogram to axis 1
fig = Phylo.draw(tree, axes=ax1)
# draw rest to axis 2
# ...
# show figure
plt.show()
I have been advised to use the matplotlib bar function to plot the domains. How would I go about doing this?
P.s. If there is a much easier way of doing this in another language I am open to it, but I would prefer to do this programatically if possible.

You could use ETE3 to implement this as well - it can load the tree as a newick, and then you can set it up with the motifs - from how I understand the documentation you'll have to have a list of lists for each organism, like so:
motifs = [[start_of_motif, end_of_motif, motif_shape, motif_width, motif_height,
foreground_color, font|size|color|label_text],
[start_of_motif2, end_of_motif2, motif2_shape, motif2_width, motif2_height,
foreground_color, font|size|color|label2_text]]
and so on.
So for example you could have this as
motifs_a = [[10, 15, "[]", None, 10, "green", "arial|12|black|IPR000001"],
[20, 40, "[]", None, 10, "yellow", "arial|12|black|IPR000002"],
[70, 130, "[]", None, 10, "red", "arial|12|black|IPR000003"]]
for your first organism, where [] for the shape means it'll be a rectangle.
You then attach it to the relevant organism. Going off ETE3's documentation, that would be:
from ete3 import Tree, SeqMotifFace, add_face_to_node
tree_with_domains = Tree("(A, (B, C))") # or Tree("path/to/newick.nwk")
protein_seq_a = "<your sequence here>"
motifs_a = [[10, 15, "[]", None, 10, "green", "arial|12|black|IPR000001"],
[20, 40, "[]", None, 10, "yellow", "arial|12|black|IPR000002"],
[70, 130, "[]", None, 10, "red", "arial|12|black|IPR000003"]]
organism_a_motif_face = SeqMotifFace(protein_seq_a, motifs=motifs_a)
(tree_with_domains & "A").add_face(organism_a_motif_face, 0, "aligned")
If you don't have the sequence, you can also pass seq=None to SeqMotifFace.

Related

Plot a line on a curve that is undersampled

I was wondering if there was a way to color a line to follow the curve from the user specified input. Example is shown below. The user wants to color a line that starts from x = 11, to x = 14 (see image below for the result). I tried f.ex df.loc[..] where it tries to locate points closest to. But then it just colors it from x = 10 to 15. Anyone have an idea how to solve this? Do I need to add extra points between two points, how would I do that? The user might also add x = 11 to x = 19.
Appreciate any help or guidance.
from bokeh.plotting import figure, output_file, show
import pandas as pd
p = figure(width=600, height=600, tools="pan,reset,save")
data = {'x': [1, 2, 3, 6, 10, 15, 20, 22],
'y': [2, 3, 6, 8, 18, 24, 50, 77]}
df = pd.DataFrame(data)
p.line(df.x, df.y)
show(p)
What the result should look like when user inputs x = 11 (start) and x = 14 (end):

With pandas you can create an interpolated DataFrame from the original.
With this you can add a new line in red.
from bokeh.plotting import figure, output_notebook, show
import pandas as pd
output_notebook()
p = figure(width=600, height=600, tools="pan,reset,save")
data = {'x': [1, 2, 3, 6, 10, 15, 20, 22],
'y': [2, 3, 6, 8, 18, 24, 50, 77]}
df = pd.DataFrame(data)
df_interpolated = (df.copy()
.set_index('x')
.reindex(index = range(df['x'].min(), df['x'].max()))
.reset_index() # optional, you could write 'index' in the second line plot, too.
.interpolate()
)
p.line(df.x, df.y)
p.line(df_interpolated.x[11:14], df_interpolated.y[11:14], color="red")
show(p)

Avoid duplicate labels in matplotlib with x, y of form [[...], [...] ...]

Firstly, let me post a link to a similar post but with a slight difference.
I am having trouble to create legend with unique labels with input data of form:
idl_t, idl_q = [[0, 12, 20], [8, 14, 24]], [[90, 60, 90], [90, 60, 90]]
and plotting is as following:
plt.plot(idl_t, idl_q, label="Some label")
The results is that I have multiple labels of the same text. The link posted before was having similar problems the OP there was using data of format:
idl_t, idl_q = [1,2], [2,3]
which is different from my case and I am not sure if the logic there can be applied to my case
So the question is how do I avoid duplicate labels without changing data input?

You can get the handles and labels used to make the legend and modify them. In the code below, these labels/handles are made into a dictionary which keeps unique dictionary keys (associated with your labels here), leading to loosing duplicate labels. (You may want to manipulate them differently to achieve your goal.)
import matplotlib.pyplot as plt
idl_t, idl_q = [[0, 12, 20], [8, 14, 24]], [[90, 60, 90], [90, 60, 90]]
plt.plot(idl_t, idl_q, label="Some label")
# get legend handles and their corresponding labels
handles, labels = plt.gca().get_legend_handles_labels()
# zip labels as keys and handles as values into a dictionary, ...
# so only unique labels would be stored
dict_of_labels = dict(zip(labels, handles))
# use unique labels (dict_of_labels.keys()) to generate your legend
plt.legend(dict_of_labels.values(), dict_of_labels.keys())
plt.show()

It can be done in a single line, but I am afraid it does not seem to read that well.
fig, ax = plt.subplots()
for i, (x, y) in enumerate(zip(zip(*idl_t), zip(*idl_q))):
ax.plot(x, y, label="Label" if i == 0 else "_nolabel_")
Maybe something like this is simpler to understand:
for i in range(len(idl_t)):
ax.plot(
[xval[i] for xval in idl_t],
[yval[i] for yval in idl_q],
label="Label" if i == 0 else "_nolabel_"
)
ax.legend()
In case you want to plot [0, 12, 20] with [90, 60, 90], and [8, 14, 24] with [90, 60, 90]:
for i, (x, y) in enumerate(zip(idl_t, idl_q)):
ax.plot(x, y, label="Label" if i == 0 else "_nolabel_")
ax.legend()
plt.show()

Creating a heatmap with uneven block sizes / stacked bar chart using Python

I want to create a heatmap in Python that is similar to what is shown on the bottom of this screenshot from TomTom Move: https://d2altcye8lkl9f.cloudfront.net/2021/03/image-1.png (source: https://support.move.tomtom.com/ts-output-route-analysis/)
A route contains multiple segments that vary in length. Each segment consists of the average speed which I want to color using the colormap (green for fast speed to yellow to red for slow speed). I was able to plot each segment in their correct order using a stacked histplot, but when I add hue, it orders the segments with the fastest average speeds first to slowest, and not the segments in their correct order.
There are three time sets containing 4 segments with their length, length of the route so far and speed for each segment for each time set.
import pandas as pd
d = {'timeRanges': ['00:00-06:30', '00:00-06:30', '00:00-06:30', '00:00-06:30', '07:00-15:00', '07:00-15:00', '07:00-15:00', '07:00-15:00', '16:00-17:30', '16:00-17:30', '16:00-17:30', '16:00-17:30'], 'segmentOrder': [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], 'segmentDistance': [20, 45, 60, 30, 20, 45, 60, 30, 20, 45, 60, 30], 'distanceAlongRoute': [20, 65, 125, 155, 20, 65, 125, 155, 20, 65, 125, 155], 'averageSpeed': [54.2, 48.1, 23.5, 33.7, 56.2, 53.2, 42.5, 44.2, 50.2, 46.2, 35.3, 33.2]}
df = pd.DataFrame(data=d)
I have tried using seaborn heatmap and imshow and I have yet to make the x axis block widths vary for each segment.
Much appreciated.

Here is a simple example of a heatmap with different box sizes. Based on the example "Heatmap with Unequal Block Sizes" https://plotly.com/python/heatmaps/. Just set the xe variable to all of the x-axis edges and z to the values that will be used for determining the colors between those points. There should be 1 fewer z value than xe value.
import plotly.graph_objects as go
import numpy as np
xe = [0, 1, 2, 5, 6]
ye = [0, 1]
z = [[1, 2, 1, 3]]
fig = go.Figure(data=go.Heatmap(
x = np.sort(xe),
y = np.sort(ye),
z = z,
type = 'heatmap',
colorscale = 'Viridis'))
fig.update_layout(margin = dict(t=200,r=200,b=200,l=200),
showlegend = False,
width = 700, height = 500,
autosize = False
)
fig.show()

Matplotlib how to dotplot variable number of points over time?

I'm trying to build an audiofingerprint algorithm like Shazam.
I have a variable length array of frequency point data like so:
[[69, 90, 172],
[6, 18, 24],
[6, 18],
[6, 18, 24, 42],
[]
...
I would like to dotplot it like a spectrogram sort of like this. My data doesn't explicitly have a time series axes but each row is a 0.1s slice of time. I am aware of plt.specgram.

np.repeat can create an accompanying array of x's. It needs an array of sizes to be calculated from the input values.
Here is an example supposing the x's are .1 apart (like in the post's description, but unlike the example image).
import numpy as np
import matplotlib.pyplot as plt
# ys = [[69, 90, 172], [6, 18, 24], [6, 18], [6, 18, 24, 42]]
ys = [np.random.randint(50, 3500, np.random.randint(2, 6)) for _ in range(30)]
sizes = [len(y) for y in ys]
xs = [np.repeat(np.arange(.1, (len(ys) + .99) / 10, .1), sizes)]
plt.scatter(xs, np.concatenate(ys), marker='x', color='blueviolet')
plt.show()

Create a 100 % stacked area chart with matplotlib

I was wondering how to create a 100 % stacked area chart in matplotlib. At the matplotlib page I couldn't find an example for it.
Somebody here can show me how to achieve that?

A simple way to achieve this is to make sure that for every x-value, the y-values sum to 100.
I assume that you have the y-values organized in an array as in the example below, i.e.
y = np.array([[17, 19, 5, 16, 22, 20, 9, 31, 39, 8],
[46, 18, 37, 27, 29, 6, 5, 23, 22, 5],
[15, 46, 33, 36, 11, 13, 39, 17, 49, 17]])
To make sure the column totals are 100, you have to divide the y array by its column sums, and then multiply by 100. This makes the y-values span from 0 to 100, making the "unit" of the y-axis percent. If you instead want the values of the y-axis to span the interval from 0 to 1, don't multiply by 100.
Even if you don't have the y-values organized in one array as above, the principle is the same; the corresponding elements in each array consisting of y-values (e.g. y1, y2 etc.) should sum to 100 (or 1).
The below code is a modified version of the example #LogicalKnight linked to in his comment.
import numpy as np
from matplotlib import pyplot as plt
fnx = lambda : np.random.randint(5, 50, 10)
y = np.row_stack((fnx(), fnx(), fnx()))
x = np.arange(10)
# Make new array consisting of fractions of column-totals,
# using .astype(float) to avoid integer division
percent = y / y.sum(axis=0).astype(float) * 100
fig = plt.figure()
ax = fig.add_subplot(111)
ax.stackplot(x, percent)
ax.set_title('100 % stacked area chart')
ax.set_ylabel('Percent (%)')
ax.margins(0, 0) # Set margins to avoid "whitespace"
plt.show()
This gives the output shown below.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a phylogenetic tree with domain annotations using BioPython - python

Related

Plot a line on a curve that is undersampled

Avoid duplicate labels in matplotlib with x, y of form [[...], [...] ...]

Creating a heatmap with uneven block sizes / stacked bar chart using Python

Matplotlib how to dotplot variable number of points over time?

Create a 100 % stacked area chart with matplotlib

Categories

Resources