Infer the length of a sequence using the CIGAR

Infer the length of a sequence using the CIGAR - python

To give you a bit of context: I am trying to convert a sam file to bam
samtools view -bT reference.fasta sequences.sam > sequences.bam
which exits with the following error
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file
and the offending line looks like this:
SRR808297.2571281 99 gi|309056|gb|L20934.1|MSQMTCG 747 80 101M = 790 142 TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT ##CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC#=?CACC>CCC NM:i:2 MD:Z:98A1A
My sequence is 98 characters long but a probable bug when creating the sam file reported 101 in the CIGAR. I can give myself the luxury to loss a couple of reads and I don't have access at the moment to the source code that produced the sam files, so no opportunity to hunt down the bug and re-run the alignment. In other words, I need a pragmatic solution to move on (for now). Therefore, I devised a python script that counts the length of my string of nucleotides, compares it with what is registered in the CIGAR, and saves the "sane" lines in a new file.
#!/usr/bin/python
import itertools
import cigar
with open('myfile.sam', 'r') as f:
for line in itertools.islice(f,3,None): #Loop through the file and skip the first three lines
cigar=line.split("\t")[5]
cigarlength = len(Cigar(cigar)) #Use module Cigar to obtain the length reported in the CIGAR string
seqlength = len(line.split("\t")[9])
if (cigarlength == seqlength):
...Preserve the line in a new file...
As you can see, to translate the CIGAR into an integer showing the length, I am using the module CIGAR. To be honest, I am a bit wary of its behavior. This module seems to miscalculate the length in very obvious cases. Is there another module or a more explicit strategy to translate the CIGAR into the length of the sequence?
Sidenote: Interesting, to say the least, that this problem has been widely reported but no pragmatic solution can be found in the internet. See the links below:
https://github.com/COMBINE-lab/RapMap/issues/9
http://seqanswers.com/forums/showthread.php?t=67253
http://seqanswers.com/forums/showthread.php?t=21120
https://groups.google.com/forum/#!msg/snap-user/FoDsGeNBDE0/nRFq-GhlAQAJ

The SAM spec offers us this table of CIGAR operations which indicates which ones "consume" the query or the reference, complete with explicit instructions on how to calculate sequence length from a CIGAR string:
Consumes Consumes
Op BAM Description query reference
M 0 alignment match (can be a sequence match or mismatch) yes yes
I 1 insertion to the reference yes no
D 2 deletion from the reference no yes
N 3 skipped region from the reference no yes
S 4 soft clipping (clipped sequences present in SEQ) yes no
H 5 hard clipping (clipped sequences NOT present in SEQ) no no
P 6 padding (silent deletion from padded reference) no no
= 7 sequence match yes yes
X 8 sequence mismatch yes yes
“Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
...
Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ.
This lets us trivially calculate the length of a sequence from its CIGAR by adding up the lengths of all the "consumes query" ops in the CIGAR. This is exactly what happens in the cigar module (see https://github.com/brentp/cigar/blob/754cfed348364d390ec1aa40c951362ca1041f7a/cigar.py#L88-L93), so I don't know why the OP here reckoned that module's implementation was wrong.
If we extract out the relevant code from the (already very short) cigar module, we're left with something like this as a short Python implementation of the summing operation described in the quote above:
from itertools import groupby
def query_len(cigar_string):
"""
Given a CIGAR string, return the number of bases consumed from the
query sequence.
"""
read_consuming_ops = ("M", "I", "S", "=", "X")
result = 0
cig_iter = groupby(cigar_string, lambda chr: chr.isdigit())
for _, length_digits in cig_iter:
length = int(''.join(length_digits))
op = next(next(cig_iter)[1])
if op in read_consuming_ops:
result += length
return result

I suspect the reason there isn't a tool to fix this problem is because there is no general solution, aside from performing the alignment again using software that does not exhibit this problem. In your example, the query sequence aligns perfectly to the reference and so in that case the CIGAR string is not very interesting (just a single Match operation prefixed by the overall query length). In that case the fix simply requires changing 101M to 98M.
However, for more complex CIGAR strings (e.g. those that include Insertions, Deletions, or any other operations), you would have no way of knowing which part of the CIGAR string is too long. If you subtract from the wrong part of the CIGAR string, you'll be left with a misaligned read, which is probably worse for your downstream analysis than just leaving the whole read out.
That said, if it happens to be trivial to get it right (perhaps your broken alignment procedure always adds extra bases to the first or last CIGAR operation), then what you need to know is the correct way to calculate the query length according to the CIGAR string, so that you know what to subtract from it.
samtools calculates this using the htslib function bam_cigar2qlen.
The other functions that bam_cigar2qlen calls are defined in sam.h, including a helpful comment showing the truth table for which operations consume query sequence vs reference sequence.
In short, to calculate the query length of a CIGAR string the way that samtools (really htslib) does it, you should add the given length for CIGAR operations M, I, S, =, or X and ignore the length of CIGAR operations for any of the other operations.
The current version of the python cigar module seem to be using the same set of operations, and the algorithm for calculating the query length (which is what len(Cigar(cigar)) would return) looks right to me. What makes you think that it isn't giving the correct results?
It looks like you should be able to use the cigar python module to hard clip from either the left or right end using the mask_left or mask_right method with mask="H".

Related

python runtime 3x deviation for 32 vs 34 char IDs

I am running an aggregation script, which heavily relies on aggregating / grouping on an identifier column. Each identifier in this column is 32 character long as a result of a hashing function.
so my ID column which will be used in pandas groupby has something like
e667sad2345...1238a
as an entry.
I tried to add a prefix "ID" to some of the samples, for easier separation afterwards. Thus, I had some identifiers with 34 characters and others still with 32 characters.
e667sad2345...1238a
IDf7901ase323...1344b
Now the aggregation script takes 3 times as long (6000 vs 2000 seconds). And the change in the ID column (adding the prefix) is the only thing which happened. Also note, that I generate data separately and save a pickle file which is read in by my aggregation script as input. So the prefix addition is not part of the runtime I am talking about.
So now I am stunned, why this particular change made such a huge impact. Can someone elaborate?
EDIT: I replaced the prefix with suffix so now it is
e667sad2345...1238a
f7901ase323...1344bID
and now it runs again in 2000 seconds. Does groupby use a binary search or something, so all the ID are overrepresented with the starting character 'I' ?

Ok, I had a revelation what is going on.
My entries are sorted using quick sort, which has an expected runtime of O(n * log n). But in worst case, quick sort will actually run in O(n*n). By making my entries imbalanced (20% of data starts with "I", other 80% randomly distributed over alphanumeric characters) I shifted the data to be more of a bad case for quick sort.

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.

As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.

GurobiError: Name too long (maximum name length is 255 characters)

I define a parameter t[i,s] as follows:
for i in Trucks:
for s in Slots:
t[i,s]=m.addVar(vtype=GRB.CONTINUOUS, name="t[%s,%s]"%(i,s))
I call the values of t[i,s] from an excel file. I is a list which contains a numbers from 0 to 263, also s is a list from 1 to 24. The problem appears when I run the code the following bug occurs:
GurobiError: Name too long (maximum name length is 255 characters)
How can I fix that?

In case the items in Trucks and Slots are just strings, you can limit the length that is passed to the variable name like this:
maxlen = 250
for i in Trucks:
for s in Slots:
t[i,s] = m.addVar(vtype=GRB.CONTINUOUS, name="t[%s,%s]"%(i[:maxlen],s[:maxlen]))
Strings in Python can be treated just like any other array and support slicing.

it is unfortunate that there is no great way to deal with this, or that the limit exists at all. The gurobipy variable constructors work well when using tuplelists to define the sets/indices over which a variable exists.
But then the variable names, which are strings, become a function of the tuplelist, have 255 character limit, and there does not seem to be any good way to control it (unless you start manipulating your tuples, in which case you may be losing information you wanted to keep).
why is there no way you can adjust the 255 character limit if needed? If there is, I have not found it
given that there is a limit, why doesn't gurobipy just truncate the name at 255 characters? as far as I can tell this would not really affect anything besides the way the variable appears when printed to a model file (e.g. .lp format).
it is frustrating that the answer is just "unfortunately it works this way". it can cause a model to fail when in fact there is no great reason for it to be failing

Minimum number of transformations to get from one list of integers to another

Given the lists of sequences P and T. I'm trying to write an algorithm
(function minNumOfTransformations(P,T) ) that returns the minimum number of moves or transformations required to get to T from P. These transformations include either substitution, insertion or deletion. E.g. to get to [0,2,4,5] from [0,1,2,3,5] requires at least 2 transformations; Adding 1 and substituting 4 with 3. I'm attempting to do this through dynamic programming on python.
def minNumOfTransformations(P, T):
# If first list is empty, the only option is to
# insert all the elements
if m==0:
return n
# If second list is empty, the only option is to
# remove all the characters of the first list
if n==0:
return m
# approach here is to solve simpler sub problems but this is where I get stuck
if P[m-1]==T[n-1]:
return minNumofTransformations(P[m-1], T[n-1])

What's stopping you from finding a ready answer on Google is probably that you don't know that what you're looking for is known as the Levenshtein distance, and is a standard metric for the difference between sequences.
There exists a Python package built specifically to do this and implemented in C, so it'll likely be faster than whatever you write.
If you really want to do this yourself in Python, this'll help:
https://rosettacode.org/wiki/Levenshtein_distance#Python

Possible error in very large numbers in the Fibonacci sequence (Python 2.7.6 integer math)

Edited to put questions in bold.
I wrote the following Python code (using Python 2.7.6) to calculate the Fibonacci sequence. It doesn't use any extra libraries, just the core python modules.
I was wondering if there was a limit to how may terms of the sequence I could calculate, perhaps due to the absurd length of the resulting integers, or if there would be a point where Python no longer performed the calculations accurately.
Also, for the fibopt(n) function, it seems to sometimes return the term under the one requested (e. g. 99th instead of 100th) but always works at lower terms (1st, 2nd, 10th, 15th). Why is that?
def fibopt(n): # Returns term "n" of the Fibonacci sequence.
f = [0,1] # List containing the first two numbers in the Fibonacci sequence.
x = 0 # Empty integer to store the next value in the sequence. Not really necessary.
optnum = 2 # Number of calculated entries in the sequence. Starts at 2 (0, 1).
while optnum < n: # Until the "n"th value in the sequence has been calculated.
if optnum % 10000 == 0:
print "Calculating index number %s." % optnum # Notify the user for every 10000th value calculated. This is useful because the program can take a very long time to calculate higher values (e. g. about 15 minutes on an i7-4790 for the 10000000th value).
x = [f[-1] + f[-2]] # Calculate the next value in the sequence based of the previous two. This could be built into the next line.
f.extend(x) # Append that value to the sequence. This could be f.extend([f[-1] + f[-2]]) instead.
optnum +=1 # Increment the counter for number of values calculated by 1.
del f[:-2] # Remove all values from the table except for the last two. Without this, the integers become so long that they fill 16 GB of RAM in seconds.
return f[:n] # Returns the requested term of the sequence.
def fib(n): # Similar to fibopt(n), but returns all of the terms in the sequence up to and including term "n". Can use a lot of memory very quickly.
f = [0,1]
x = 0
while len(f) < n:
x = [f[-1] + f[-2]]
f.extend(x)
return f[:n]

The good news is: integer math in Python is easy -- there are no overflows.
As long as your integers can fit within a C long, Python will use that. Once you go past that, it will auto-promote to arbitrary-precision integers (which means it'll be slower and use more memory, but the calculations will remain correct).
The only limits are:
The amount of memory addressable by the Python process. If you're using 32-bit Python, you need to be able to fit all of your data within 2 gigabytes or RAM (get past that and your program will fail with MemoryError). If you're using 64-bit Python, your physical RAM + swapfile is the theoretical limit.
The time you're willing to wait while calculations are being performed. The larger your ints, the slower the calculations are. If you ever hit your swap space, your program will reach continental drift levels of slow.

If you go to Python 2.7 documentation, there is a section on Fibonacci numbers. In this section on Fibonacci numbers, the arbitrary end is not the elongated answer we would all want to view. It shortens it.
If this does not answer your question, please see: 4.6 Defining Functions.
If you have downloaded the interpreter, the manuals come preinstalled. You can go online if necessary to www.python.org or you can view your manual to see the Fibonacci numbers that end in an "arbitrary" short, i.e. not the entire numerical value.
Seth
P.S. If you have any questions on where to find this section in your manual, please see The Python Tutorial/4. More Control Flow Tools/4.6 Defining Functions. I hope this helps a bit.

Python integers can express arbitrary-length values and will not be automatically converted to float. You can check by simply creating a very large number and checking its type:
>>> type(2**(2**25))
<class 'int'> # long in Python 2.x
fibopt returns f[:n], and that is a list. You seem to expect it to return a term, so either the expectation (the first comment) or the implementation must change.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.