Create a Dendogram from Genome - python

I wanted to play around with genomic data:
Species_A = ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
Species_B = ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
Species_C = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
Species_D = ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
Species_E = ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
I wanted to create a dendrogram based on how close these organisms are related to each other given the genome sequence above. What I did first was to count the number of a's, c's, t's and g's of each species then I created an array, then plotted a dendrogram:
gen_size1 = len(Species_A)
a1 = float(Species_A.count('a'))/float(gen_size1)
c1 = float(Species_A.count('c'))/float(gen_size1)
g1 = float(Species_A.count('g'))/float(gen_size1)
t1 = float(Species_A.count('t'))/float(gen_size1)
.
.
.
gen_size5 = len(Species_E)
a5 = float(Species_E.count('a'))/float(gen_size5)
c5 = float(Species_E.count('c'))/float(gen_size5)
g5 = float(Species_E.count('g'))/float(gen_size5)
t5 = float(Species_E.count('t'))/float(gen_size5)
my_genes = np.array([[a1,c1,g1,t1],[a2,c2,g2,t2],[a3,c3,g3,t3],[a4,c4,g4,t4],[a5,c5,g5,t5]])
plt.subplot(1,2,1)
plt.title("Mononucleotide")
linkage_matrix = linkage(my_genes, "single")
print linkage_matrix
dendrogram(linkage_matrix,truncate_mode='lastp', color_threshold=1, labels=[Species_A, Species_B, Species_C, Species_D, Species_E], show_leaf_counts=True)
plt.show()
Species A and B are variants of the same organism and I am expecting that both should diverge from a common clade form the root, same goes with Species C and D which should diverge from another common clade from the root then with Species E diverging from the main root because it is not related to Species A to D. Unfortunately the dendrogram result was mixed up with Species A and E diverging from a common clade, then Species C, D and B in another clade (pretty messed up).
I have read about hierarchical clustering for genome sequence but I have observed that it only accommodates 2 dimensional system, unfortunately I have 4 dimensions which are a,c,t and g. Any other strategy for this? thanks for the help!

This is a fairly common problem in bioinformatics, so you should use a bioinformatics library like BioPython that has this functionality builtin.
First you create a multi FASTA file with your sequences:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
sequences = ['ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag',
'ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag',
'ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg',
'ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag']
my_records = [SeqRecord(Seq(sequence, generic_dna),
id='Species_{}'.format(letter), description='Species_{}'.format(letter))
for sequence, letter in zip(sequences, 'ABCDE')]
root_dir = r"C:\Users\BioGeek\Documents\temp"
filename = 'my_sequences'
fasta_path = os.path.join(root_dir, '{}.fasta'.format(filename))
SeqIO.write(my_records, fasta_path, "fasta")
This creates the file C:\Users\BioGeek\Documents\temp\my_sequences.fasta that looks like this:
>Species_A
ctnngtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag
>Species_B
ctaagtggactgacaggaactgtttcgaatcggaagcttgcttaacgtag
>Species_C
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgtag
>Species_D
ctacgtggaccgacaagaacagtttcgactcggaagcttgcttaacgccg
>Species_E
ctgtgtggancgacaaggacagttccaaatcggaagcttgcttaacacag
Next, use the command line tool ClustalW to do a multiple sequence alignment:
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = r"C:\path\to\clustalw-2.1\clustalw2.exe"
assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
clustalw_cline = ClustalwCommandline(clustalw_exe, infile=fasta_path)
stdout, stderr = clustalw_cline()
print stdout
This prints:
CLUSTAL 2.1 Multiple Sequence Alignments
Sequence format is Pearson
Sequence 1: Species_A 50 bp
Sequence 2: Species_B 50 bp
Sequence 3: Species_C 50 bp
Sequence 4: Species_D 50 bp
Sequence 5: Species_E 50 bp
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 90
Sequences (1:3) Aligned. Score: 94
Sequences (1:4) Aligned. Score: 88
Sequences (1:5) Aligned. Score: 84
Sequences (2:3) Aligned. Score: 90
Sequences (2:4) Aligned. Score: 84
Sequences (2:5) Aligned. Score: 78
Sequences (3:4) Aligned. Score: 94
Sequences (3:5) Aligned. Score: 82
Sequences (4:5) Aligned. Score: 82
Guide tree file created: [C:\Users\BioGeek\Documents\temp\my_sequences.dnd]
There are 4 groups
Start of Multiple Alignment
Aligning...
Group 1: Sequences: 2 Score:912
Group 2: Sequences: 2 Score:921
Group 3: Sequences: 4 Score:865
Group 4: Sequences: 5 Score:855
Alignment Score 2975
CLUSTAL-Alignment file created [C:\Users\BioGeek\Documents\temp\my_sequences.aln]
The my_sequences.dnd file ClustalW creates, is a standard Newick tree file and Bio.Phylo can parse these:
from Bio import Phylo
newick_path = os.path.join(root_dir, '{}.dnd'.format(filename))
tree = Phylo.read(newick_path, "newick")
Phylo.draw_ascii(tree)
Which prints:
____________ Species_A
____|
| |_____________________________________ Species_B
|
_| ____ Species_C
|_________|
| |_________________________ Species_D
|
|__________________________________________________________________ Species_E
Or, if you have matplotlib or pylab installed, you can create a graphic using the draw function:
tree.rooted = True
Phylo.draw(tree, branch_labels=lambda c: c.branch_length)
which produces:
This dendrogram clearly illustrates what you observed: that species A and B are variants of the same organism and both diverge from a common clade from the root. Same goes with Species C and D, both diverge from another common clade from the root. Finally, Species E diverges from the main root because it is not related to Species A to D.

Well, using SciPy you could use a custom distance (my bet is on Needleman-Wunsch or Smith-Waterman as a start). Here is an example of how to play with your input data. You should also check how to define a custom distance in SciPy. Once you have it set, you can use a more advanced alignment approach like MAFFT. You could extract the relationships between genomes and use them when you create your dendrogram.

Related

Is there a way to easily do word suggestions in a CLI? [duplicate]

How do I get the probability of a string being similar to another string in Python?
I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
Solution #1: Python builtin
use SequenceMatcher from difflib
pros:
native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.
example :
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
Solution #2: jellyfish library
its a very good library with good coverage and few issues.
it supports:
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance
pros:
easy to use, gamut of supported algorithms, tested.
cons: not native library.
example:
>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1
I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:
Hamming distance
Levenshtein distance
Damerau–Levenshtein distance
Jaro–Winkler distance
TheFuzz is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
You can create a function like:
def similar(w1, w2):
w1 = w1 + ' ' * (len(w2) - len(w1))
w2 = w2 + ' ' * (len(w1) - len(w2))
return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))
Note, difflib.SequenceMatcher only finds the longest contiguous matching subsequence, this is often not what is desired, for example:
>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012 # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)] # only the first block is recorded
Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics. There are many dedicated libraries for this including biopython. This example implements the Needleman Wunsch algorithm:
>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'
Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available. Also, you can actually get the matching sequences to visualise what is happening:
>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el
Package distance includes Levenshtein distance:
import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3
You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity
Here some examples;
Normalized, metric, similarity and distance
(Normalized) similarity and distance
Metric distances
Shingles (n-gram) based similarity and distance
Levenshtein
Normalized Levenshtein
Weighted Levenshtein
Damerau-Levenshtein
Optimal String Alignment
Jaro-Winkler
Longest Common Subsequence
Metric Longest Common Subsequence
N-Gram
Shingle(n-gram) based algorithms
Q-Gram
Cosine similarity
Jaccard index
Sorensen-Dice coefficient
Overlap coefficient (i.e.,Szymkiewicz-Simpson)
The builtin SequenceMatcher is very slow on large input, here's how it can be done with diff-match-patch:
from diff_match_patch import diff_match_patch
def compute_similarity_and_diff(text1, text2):
dmp = diff_match_patch()
dmp.Diff_Timeout = 0.0
diff = dmp.diff_main(text1, text2, False)
# similarity
common_text = sum([len(txt) for op, txt in diff if op == 0])
text_length = max(len(text1), len(text2))
sim = common_text / text_length
return sim, diff
BLEUscore
BLEU, or the Bilingual Evaluation Understudy, is a score for comparing
a candidate translation of text to one or more reference translations.
A perfect match results in a score of 1.0, whereas a perfect mismatch
results in a score of 0.0.
Although developed for translation, it can be used to evaluate text
generated for a suite of natural language processing tasks.
Code:
import nltk
from nltk.translate import bleu
from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4
C1='Text'
C2='Best'
print('BLEUscore:',bleu([C1], C2, smoothing_function=smoothie))
Examples: By updating C1 and C2.
C1='Test' C2='Test'
BLEUscore: 1.0
C1='Test' C2='Best'
BLEUscore: 0.2326589746035907
C1='Test' C2='Text'
BLEUscore: 0.2866227639866161
You can also compare sentence similarity:
C1='It is tough.' C2='It is rough.'
BLEUscore: 0.7348889200874658
C1='It is tough.' C2='It is tough.'
BLEUscore: 1.0
Textdistance:
TextDistance – python library for comparing distance between two or more sequences by many algorithms. It has Textdistance
30+ algorithms
Pure python implementation
Simple usage
More than two sequences comparing
Some algorithms have more than one implementation in one class.
Optional numpy usage for maximum speed.
Example1:
import textdistance
textdistance.hamming('test', 'text')
Output:
1
Example2:
import textdistance
textdistance.hamming.normalized_similarity('test', 'text')
Output:
0.75
Thanks and Cheers!!!
There are many metrics to define similarity and distance between strings as mentioned above. I will give my 5 cents by showing an example of Jaccard similarity with Q-Grams and an example with edit distance.
The libraries
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance import edit_distance
Jaccard Similarity
1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Appel', 2)))
and we get:
0.33333333333333337
And for the Apple and Mango
1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Mango', 2)))
and we get:
0.0
Edit Distance
edit_distance('Apple', 'Appel')
and we get:
2
And finally,
edit_distance('Apple', 'Mango')
and we get:
5
Cosine Similarity on Q-Grams (q=2)
Another solution is to work with the textdistance library. I will provide an example of Cosine Similarity
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5
Adding the Spacy NLP library also to the mix;
#profile
def main():
str1= "Mar 31 09:08:41 The world is beautiful"
str2= "Mar 31 19:08:42 Beautiful is the world"
print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio())
print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))
if __name__ == '__main__':
#python3 -m spacy download en_core_web_sm
#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_md")
main()
Run with Robert Kern's line_profiler
kernprof -l -v ./python/loganalysis/testspacy.py
NLP Similarity= 0.9999999821467294
Diff lib similarity 0.5897435897435898
Jellyfish lib similarity 0.8561253561253562
However the time's are revealing
Function: main at line 32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
32 #profile
33 def main():
34 1 1.0 1.0 0.0 str1= "Mar 31 09:08:41 The world is beautiful"
35 1 0.0 0.0 0.0 str2= "Mar 31 19:08:42 Beautiful is the world"
36 1 43248.0 43248.0 99.1 print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
37 1 375.0 375.0 0.9 print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio())
38 1 30.0 30.0 0.1 print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))
Here's what i thought of:
import string
def match(a,b):
a,b = a.lower(), b.lower()
error = 0
for i in string.ascii_lowercase:
error += abs(a.count(i) - b.count(i))
total = len(a) + len(b)
return (total-error)/total
if __name__ == "__main__":
print(match("pple inc", "Apple Inc."))
Python3.6+=
No Libuary Imported
Works Well in most scenarios
In stack overflow, when you tries to add a tag or post a question, it bring up all relevant stuff. This is so convenient and is exactly the algorithm that I am looking for. Therefore, I coded a query set similarity filter.
def compare(qs, ip):
al = 2
v = 0
for ii, letter in enumerate(ip):
if letter == qs[ii]:
v += al
else:
ac = 0
for jj in range(al):
if ii - jj < 0 or ii + jj > len(qs) - 1:
break
elif letter == qs[ii - jj] or letter == qs[ii + jj]:
ac += jj
break
v += ac
return v
def getSimilarQuerySet(queryset, inp, length):
return [k for tt, (k, v) in enumerate(reversed(sorted({it: compare(it, inp) for it in queryset}.items(), key=lambda item: item[1])))][:length]
if __name__ == "__main__":
print(compare('apple', 'mongo'))
# 0
print(compare('apple', 'apple'))
# 10
print(compare('apple', 'appel'))
# 7
print(compare('dude', 'ud'))
# 1
print(compare('dude', 'du'))
# 4
print(compare('dude', 'dud'))
# 6
print(compare('apple', 'mongo'))
# 2
print(compare('apple', 'appel'))
# 8
print(getSimilarQuerySet(
[
"java",
"jquery",
"javascript",
"jude",
"aja",
],
"ja",
2,
))
# ['javascript', 'java']
Explanation
compare takes two string and returns a positive integer.
you can edit the al allowed variable in compare, it indicates how large the range we need to search through. It works like this: two strings are iterated, if same character is find at same index, then accumulator will be added to a largest value. Then, we search in the index range of allowed, if matched, add to the accumulator based on how far the letter is. (the further, the smaller)
length indicate how many items you want as result, that is most similar to input string.
I have my own for my purposes, which is 2x faster than difflib SequenceMatcher's quick_ratio(), while providing similar results. a and b are strings:
score = 0
for letters in enumerate(a):
score = score + b.count(letters[1])

Calculate Krippendorff Alpha for Multi-label Annotation

How can I calculate Krippendorff Alpha for multi-label annotations?
In case of multi-class annotation (assuming that 3 coders have annotated 4 texts with 3 labels: a, b, c), I construct first the reliability data matrix and then coincidences and based on the coincidences I can calculate Alpha:
The question is how I can prepare the coincidences and calculate alpha in case of multi-label classification problem like the following case?
Python implementation or even excel would be appreciated.
Came across your question looking for similar information. We used the below code, with nltk.agreement for the metrics and pandas_ods_reader to read the data from a LibreOffice spreadsheet. Our data has two annotators, and for some of the items there can be two labels (for instance, one coder annotated one label only and the other coder annotated two labels instead).
The spreadsheet screencap below shows the structure of the input data. The column for annotation items is called annotItems, and annotation columns are called coder1 and coder2. The separator when there's more than one label is a pipe, unlike the comma in your example.
The code is inspired by this SO post: Low alpha for NLTK agreement using MASI distance
[Spreadsheet screencap]
from nltk import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
import pandas_ods_reader as pdreader
annotfile = "test-iaa-so.ods"
df = pdreader.read_ods(annotfile, "Sheet1")
annots = []
def create_annot(an):
"""
Create frozensets with the unique label
or with both labels splitting on pipe.
Unique label has to go in a list so that
frozenset does not split it into characters.
"""
if "|" in str(an):
an = frozenset(an.split("|"))
else:
# single label has to go in a list
# need to cast or not depends on your data
an = frozenset([str(int(an))])
return an
for idx, row in df.iterrows():
annot_id = row.annotItem + str.zfill(str(idx), 3)
annot_coder1 = ['coder1', annot_id, create_annot(row.coder1)]
annot_coder2 = ['coder2', annot_id, create_annot(row.coder2)]
annots.append(annot_coder1)
annots.append(annot_coder2)
# based on https://stackoverflow.com/questions/45741934/
jaccard_task = agreement.AnnotationTask(distance=jaccard_distance)
masi_task = agreement.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(annots)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
For the data in the screencap linked from this answer, this would print:
Statistics for dataset using <function jaccard_distance at 0x7fa1464b6050>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.1818181818181818
Kappa: 0.35714285714285715
Multi-Kappa: 0.35714285714285715
Alpha: 0.02941176470588236
Statistics for dataset using <function masi_distance at 0x7fa1464b60e0>
C: {'coder1', 'coder2'}
I: {'item3002', 'item1000', 'item6005', 'item5004', 'item2001', 'item4003'}
K: {frozenset({'1'}), frozenset({'0'}), frozenset({'0', '1'})}
Pi: 0.09181818181818181
Kappa: 0.2864285714285714
Multi-Kappa: 0.2864285714285714
Alpha: 0.017962466487935425

Biopython: adding section in middle sequence an having features aligned

I want to add section of sequence in middle of previous sequence(in gb file) and have all features still indexed by old sequence.
For example:
previous sequence: ATAGCCATTGAATGTGTGTGTGTCCTAGAGGGCCTAAAA
fetaure: misc_feature complement(20..27)
/gene="Py_ori+A"
I add TTTTTT in position 10.
new sequence: ATAGCCATTGTTTTTTAAGTGTGTGTGTCCTAGAGGGCCTAAAA
fetaure: misc_feature complement(26..33)
/gene="Py_ori+A"
Indexes of features changed because feature must still be about segment TGTCCTA. I want to save the new sequence in a new gb file.
Is there any biopython function or method that could add segment of sequence in middle of old sequence and add length of added segment to indexes of features, that are after the added segment?
TL;DR
Call + on your sliced segments (e.g. a + b). As long as you didn't slice into a feature you should be OK.
The long version:
the BioPython supports feature joining. It is done simply by calling a + b on the respective SeqRecord classes (the features are part of the SeqRecord object not the Seq class.).
There are a quirk to be aware of regarding slicing sequence with features. If you happen to do slicing in the feature, the feature will not be present in resulting SeqRecord.
I've tried to illustrate the behaviour in the following code.
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
# THIS IS OK
a = SeqRecord(
Seq('ACGTA'),
id='a',
features=[
SeqFeature(FeatureLocation(2,4,1), id='f1')
]
)
b = SeqRecord(
Seq('ACGTA'),
id='b',
features=[
SeqFeature(FeatureLocation(2,4,1), id='f2')
]
)
c = a + b
print('seq a')
print(a.seq)
print(a.features)
print('\nseq b')
print(b.seq)
print(b.features)
print("\n two distinct features joined in seq c")
print(c.seq)
print(c.features)
print("notice how the second feature has now indices (7,9), instead of 2,4\n")
# BEWARE
# slicing into the feature will remove the feature !
print("\nsliced feature removed")
d = a[:3]
print(d.seq)
print(d.features)
# Seq('ACG')
# []
# However slicing around the feature will preserve it
print("\nslicing out of the feature will preserve it")
e = c[1:6]
print(e.seq)
print(e.features)
OUTPUT
seq a
ACGTA
[SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f1')]
seq b
ACGTA
[SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f2')]
two distinct features joined in seq c
ACGTAACGTA
[SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f1'), SeqFeature(FeatureLocation(ExactPosition(7), ExactPosition(9), strand=1), id='f2')]
notice how the second feature has now indices (7,9), instead of 2,4
sliced feature removed
ACG
[]
slicing out of the feature will preserve it
CGTAA
[SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(3), strand=1), id='f1')]

Reindexing error within np.where and for loop

I have a large CSV table with ~1,200 objects. I am narrowing down those to a volume limited sample (VLS) of 326 objects by setting certain parameters (only certain distances, etc.)
Within this VLS I am using a for loop to count the number of specific types of objects. I don't want to just count the entire VLS at once though, instead it'll count in "sections" (think of drawing boxes on a scatter plot and counting what's in each box).
I'm pretty sure my issue is because of the way pandas reads in the columns of my CSV table and the "box" array I have can't talk to the columns that are "dtype: object."
I don't expect someone to have a perfect fix for this, but even pointing me to some specific and relevant information on pandas would be helpful and appreciated. I try reading the documentation for pandas, but I don't understand much.
This is how I read in the CSV table and my columns in case it's relevant:
file = pd.read_csv(r'~/Downloads/CSV')
#more columns than this, but they're all defined like this in my code
blend = file["blend"]
dec = file["dec"]
When I define my VLS inside the definition of the section I'm looking at (named 'box') the code does work, and the for loop properly counts the objects.
This is what it looks like when it works:
color = np.array([-1,0,1])
for i in color:
box1 = np.where((constant box parameters) & (variable par >= i)&
(variable par < i+1) &('Volume-limited parameters I wont list'))[0]
binaries = np.where(blend[box1].str[:1].eq('Y'))[0]
candidates = np.where(blend[box1].str[0].eq('?'))[0]
singles = np.where(blend[box1].str[0].eq('N'))[0]
print ("from", i, "to", i+1, "there are", len(binaries), "binaries,", len(candidates), "candidates,", len(singles), "singles.")
# Correct Output:
"from -1 to 0 there are 7 binaries, 1 candidates, 78 singles."
"from 0 to 1 there are 3 binaries, 1 candidates, 24 singles."
"from 1 to 2 there are 13 binaries, 6 candidates, 69 singles."
The problem, is that I don't want to include the parameters for my VLS in the np.where() for "box". This is how I would like my code to look:
vollim = np.where((dec >= -30)&(dec <= 60) &(p_anglemas/err_p_anglemas
>= 5) &(dist<=25) &(err_j_k_mag < 0.2))[0]
j_k_mag_vl = j_k_mag[vollim]
abs_jmag_vl = abs_jmag[vollim]
blend_vl = blend[vollim]
hires_vl = hires[vollim]
#%%
color = np.array([-1,0,1])
for i in color:
box2 = np.where((abs_jmag_vl >= 13)&(abs_jmag_vl <= 16) &
(j_k_mag_vl >= i)&(j_k_mag_vl < i+1))[0]
binaries = np.where(blend_vl[box2].str[:1].eq('Y'))[0]
candidates = np.where(blend_vl[box2].str[0].eq('?'))[0]
singles = np.where(blend_vl[box2].str[0].eq('N'))[0]
print ("from", i, "to", i+1, "there are", len(binaries), "binaries,", len(candidates), "candidates,", len(singles), "singles.")
#Wrong Output:
"from -1 to 0 there are 4 binaries, 1 candidates, 22 singles."
"from 0 to 1 there are 1 binaries, 0 candidates, 5 singles."
"from 1 to 2 there are 4 binaries, 0 candidates, 14 singles."
When I print blend_vl[box2] a lot of the elements for blend_vl have been changed from their regular strings to "NaN" which I do not understand.
When I print box1 and box2 they are the same lengths but they are different indices.
I think blend_vl[box2] would work properly if I changed blend_vl into a flat array?
I know this is a lot of information at once, but I appreciate any input. Even if just some more info about how pandas and arrays. TIA!!

parsing information out of sequencing data

I have a txt file that is a converted fasta file that just has a particular region that I'm interested in analyzing. It looks like this
CTGGCCGCGCTGACTCCTCTCGCT
CTCGCAGCACTGACTCCTCTTGCG
CTAGCCGCTCTGACTCCGCTAGCG
CTCGCTGCCCTCACACCTCTTGCA
CTCGCAGCACTGACTCCTCTTGCG
CTCGCAGCACTAACACCCCTAGCT
CTCGCTGCTCTGACTCCTCTCGCC
CTGGCCGCGCTGACTCCTCTCGCT
I am currently using excel to perform some calculations on the nucleotide diversity at each position. Some of the files have like 200,000 reads so this makes the excel files unwieldy. I figure there must be an easier way to do this using python or R.
Basically I want to take the .txt file with the list of sequences and measure nucleotide diversity at each position using this equation –p(log2(p)). Does anyone know how this might be done in a way besides excel?
Thanks so much in advance for any help.
If you can work from the fasta file, that might be better, as there are
packages specifically designed to work with that format.
Here, I give a solution in R, using the packages seqinr and also
dplyr (part of tidyverse) for manipulating data.
If this were your fasta file (based on your sequences):
>seq1
CTGGCCGCGCTGACTCCTCTCGCT
>seq2
CTCGCAGCACTGACTCCTCTTGCG
>seq3
CTAGCCGCTCTGACTCCGCTAGCG
>seq4
CTCGCTGCCCTCACACCTCTTGCA
>seq5
CTCGCAGCACTGACTCCTCTTGCG
>seq6
CTCGCAGCACTAACACCCCTAGCT
>seq7
CTCGCTGCTCTGACTCCTCTCGCC
>seq8
CTGGCCGCGCTGACTCCTCTCGCT
You can read it into R using the seqinr package:
# Load the packages
library(tidyverse) # I use this package for manipulating data.frames later on
library(seqinr)
# Read the fasta file - use the path relevant for you
seqs <- read.fasta("~/path/to/your/file/example_fasta.fa")
This returns a list object, which contains as many elements as there are
sequences in your file.
For your particular question - calculating diversity metrics for each position -
we can use two useful functions from the seqinr package:
getFrag() to subset the sequences
count() to calculate the frequency of each nucleotide
For example, if we wanted the nucleotide frequencies for the first position of
our sequences, we could do:
# Get position 1
pos1 <- getFrag(seqs, begin = 1, end = 1)
# Calculate frequency of each nucleotide
count(pos1, wordsize = 1, freq = TRUE)
a c g t
0 1 0 0
Showing us that the first position only contains a "C".
Below is a way to programatically "loop" through all positions and to do the
calculations we might be interested in:
# Obtain fragment lenghts - assuming all sequences are the same length!
l <- length(seqs[[1]])
# Use the `lapply` function to estimate frequency for each position
p <- lapply(1:l, function(i, seqs){
# Obtain the nucleotide for the current position
pos_seq <- getFrag(seqs, i, i)
# Get the frequency of each nucleotide
pos_freq <- count(pos_seq, 1, freq = TRUE)
# Convert to data.frame, rename variables more sensibly
## and add information about the nucleotide position
pos_freq <- pos_freq %>%
as.data.frame() %>%
rename(nuc = Var1, freq = Freq) %>%
mutate(pos = i)
}, seqs = seqs)
# The output of the above is a list.
## We now bind all tables to a single data.frame
## Remove nucleotides with zero frequency
## And estimate entropy and expected heterozygosity for each position
diversity <- p %>%
bind_rows() %>%
filter(freq > 0) %>%
group_by(pos) %>%
summarise(shannon_entropy = -sum(freq * log2(freq)),
het = 1 - sum(freq^2),
n_nuc = n())
The output of these calculations now looks like this:
head(diversity)
# A tibble: 6 x 4
pos shannon_entropy het n_nuc
<int> <dbl> <dbl> <int>
1 1 0.000000 0.00000 1
2 2 0.000000 0.00000 1
3 3 1.298795 0.53125 3
4 4 0.000000 0.00000 1
5 5 0.000000 0.00000 1
6 6 1.561278 0.65625 3
And here is a more visual view of it (using ggplot2, also part of tidyverse package):
ggplot(diversity, aes(pos, shannon_entropy)) +
geom_line() +
geom_point(aes(colour = factor(n_nuc))) +
labs(x = "Position (bp)", y = "Shannon Entropy",
colour = "Number of\nnucleotides")
Update:
To apply this to several fasta files, here's one possibility
(I did not test this code, but something like this should work):
# Find all the fasta files of interest
## use a pattern that matches the file extension of your files
fasta_files <- list.files("~/path/to/your/fasta/directory",
pattern = ".fa", full.names = TRUE)
# Use lapply to apply the code above to each file
my_diversities <- lapply(fasta_files, function(f){
# Read the fasta file
seqs <- read.fasta(f)
# Obtain fragment lenghts - assuming all sequences are the same length!
l <- length(seqs[[1]])
# .... ETC - Copy the code above until ....
diversity <- p %>%
bind_rows() %>%
filter(freq > 0) %>%
group_by(pos) %>%
summarise(shannon_entropy = -sum(freq * log2(freq)),
het = 1 - sum(freq^2),
n_nuc = n())
})
# The output is a list of tables.
## You can then bind them together,
## ensuring the name of the file is added as a new column "file_name"
names(my_diversities) <- basename(fasta_files) # name the list elements
my_diversities <- bind_rows(my_diversities, .id = "file_name") # bind tables
This will give you a table of diversities for each file. You can then use ggplot2 to visualise it, similarly to what I did above, but perhaps using facets to separate the diversity from each file into different panels.
you can open and read your file :
plist=[]
with open('test.txt', 'r') as infile:
for i in infile:
# make calculation of 'p' for each line here
plist.append(p)
And then use you plist to make calculation of your entropy

Categories

Resources