BioPython AlignIO sequences must be the same length [multiple files] - python

I got an issue when I try to align multiple files
Here is my script:
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.Align.Applications import ClustalOmegaCommandline
def divergence(fic1dna,fic2dna,fic1prot,fic2prot):
from Bio import SeqIO
seq1dna = list(SeqIO.parse(fic1dna, "fasta",alphabet=IUPAC.IUPACUnambiguousDNA()))
seq2dna = list(SeqIO.parse(fic2dna, "fasta",alphabet=IUPAC.IUPACUnambiguousDNA()))
seq1prot = list(SeqIO.parse(fic1prot, "fasta",alphabet=IUPAC.protein))
seq2prot= list(SeqIO.parse(fic2prot, "fasta",alphabet=IUPAC.protein))
u=0
while u < len(seq1dna): # make an alignment betwen each element on 2 files for 2 paires files
nuc1=str(seq1dna[u].seq)
nuc2=str(seq2dna[u].seq)
prot1=str(seq1prot[u].seq)
prot2=str(seq2prot[u].seq)
prot1 = SeqRecord(Seq(prot1, alphabet=IUPAC.protein),id='pro1')
prot2 = SeqRecord(Seq(prot2, alphabet=IUPAC.protein),id='pro2')
aln = MultipleSeqAlignment([prot1, prot2])
print(aln)
u+=1
print(divergence("concatenate_0035_fna_renamed.fst","concatenate_0042_fna_renamed.fst","concatenate_0035_faa_renamed.fst","concatenate_0042_faa_renamed.fst"))
So, as you can see I have 4 files, corresponding to 244 sequences from 2 species and I need to calculate dN dS for each of them so, I need to align each paired seq in codon alignment.
But, when I'm trying to align my 244 protein sequences, the error " ValueError("Sequences must all be the same length") raises "
I do not know why the script does not accept sequence with different length since all other programmes do.
short input would be :
one file with the AA seq from the sp 1
>EOG090X005Q
CEHNTAGRDCEKCLDFYNDAPWGRASPTNVHECKACNCNGFSNKCYFDKDLYERTGHGGHCIDCEENRDGANCERCKENFYQGMEDICLPCNCNPTGSRSLQCNAEGKCQCKPGVTGDKCDVCAPNYFEFTMHGCKPCDCNVSGSYGNTPQCDPQTGVCLCKQNVEGRRCRECKPGFFNLDVENEFGCTPCFCFGHSSQCSSAPKYQAHEISAHYIRDAEKWGAEDDQRKPVQLQFNANTQNIAVASKGSEILYFLASGQFLGDQRPSYNHDLKFTLRLGESGGYPSSQDIILEGARSSVSMNIYAQNNPEPSDVAQEYSFRLHEDPRYGWTPTLSNFEFMSILQNLTAIKIRGTYNKGGVGYLINFKLETAKIGREKGSAPANWVEKCSCPKAYVGDYCEECAPGYKHEPANGGPYSTCIPCDCNGHAHICDTATGFCICKHNTTGSNCELCAKGFYGNAIAGTADDCKPCPCPKDSGCIQLMDQSIVCTDCPVGYAGPRCEVCADAHFGDPTGQFGAPQECEECQCNGNVDPNAVGNCNRTTGECLKCIYNTAGEHCDKCLSGYFGDALDQKKKGDCKPCQCLEAGTVESPEGARKAPLCDGLTGFCSCRPHVIGRNCDKCEVDLNCIAVLKT
>EOG090X00BV
MNAHFPQNEIARSEAYNIMSVRKQYLVPKDGTPLSGLIQDHVISGVKMSIRGAFFTKADYQQLVFQALSNHKGEIKLLPPTILKPIMLWSGKQILSTIIINSIPKGKPYLSLTGKAKISSKAWQKEPARTWNAGGTPFTNPNSMSEAEVIIRKGELLCGVLDKTHYGATPYGLVHCMYELYGGDSSSALLSSFSKVFTFYLQWIGFTLGVKDILVVEEADKQRDNFINLVRKVGKVAAAKATELPVDVDELKLKETISEMLIKDPKFRANLDRQYKSLLDSYTNNINTVCLSEGLLEKFPYNNLQLMVQSGAKGSTVNTMQISCLLGQIELEGKRPPLMISGRSLPSFPPYDISPRAGGFIDGRFMTGIQPQEFFFHCMAGREGLIDTAVKTSRSGYLQRCLIKHLEGLSVAYDHTVRDSDSSVIQFAYGEDGLDVIKCQYFNKDQFEFLDVNSNAVISKSAIKKLKEDDKSKALAKSQKSLKKWKKKNGNPFEKVRYSPFTEFSAIAKNDIVLDDKPTDQTRDPNYWELEKMWRNLDADEKKQYARKRCPDPIPSKYSPEYKFGVINEQLNELTQNYLKNRKEHMYSDYTDKDKFTEIINAKYLASMAAPGEPVGLLAAQSIGEPSTQMTLNTFHFAGRGDMNVTLGIPRLREILMTASAKLKTPSMDIPFRSDLPDLNKKAERLRQKMNRVTVSDVLEKIDVHCEIVTNPNRQLKTVMRFSFLPHSQYKVQYTVKPAQIIKHMQNKFFSEMFSIIRKQAKTTCGVMWSTEKEKKRRAASDEDDEDGEGASPDVAEKAVNMDEDSSDEEGPNDDDDNTDVS
and the other for the specie 2:
>EOG090X005Q
MGGKIAAILLFAFFTSGSRSEPDFVDGQFNKINKNRVEVKCYDDFGAPQRCIPPFENAAFGVLMEATNTCGQDGRPTEFCRQTGVQRKPCEFCHPGDHPASFLTDRDNNDNATWWQSETMHEGIEYPNKVVLTLNLGKTYDITYVRVLFESPRPESWGIFRRRTEDSPWEPYQFYSATCRDTYGLPDRKDTVRGEDTRVLCTSEYSDISPLRRGTVAFSTLEGRPSAFQFDTNPALQSWVQATDLRLSLDRPNTFGDELFGDGQVLKSYYYAIADVAVGARCACNGHAGECINSPHTNGTTRRVCRCEHNTAGPDCNECLPFYNDAPWGRATTTDAHECKPCNCNGYSDRCYFDKDLYERSGHGGHCTDCRANRAGPNCERCRENFYQRLEDSYCVACNCNEIGSRSLQCNSEGKCQCKPGITGDKCDRCAANFFNFDSLGCTSCECSPKGSLDNEPNCDPVSGACVCKENVEGKRCRECRPGFFNLDLDNEFGCTPCFCYGHSSVCNLANGYSKLTIESMFGRGNEKWTASVAGNPIPLHYDAVTQTISVNAPDRDNVYFVAPERFLGDQRASYNQDLTFTLRIAENEPAPTARDVILEGGNGEQLTQPIFGQTNQLPNASPQVYKFRLNEHADYGWEPRVTSRAFMSVLSNLTAIKIRGTYTHQGRGFLDDVSLETAQRGAAGEPADWIEHCQCPHGYVGQFCESCAPGFHHDPPNGGPFSLCVPCNCNGHADICEAETGQCICHHNTAGSNCDLCSRGFYGYPLKGTPHDCKPCPCPDNGPCILLGNNPDPICSECPSGRTGARCETCSDGYFGNPDQGQACRLCDCNNNIDLNAVRNCNHETGECLKCVNNTAGFHCEDCLSGYWGDALSERKEDSCKLCQCYPPGTIELDDGSVAPCNQLTGHCACKPHVIGRNCDKCEDGYYQILSGDGCTACNCDPEGSYNRTCDATTGQCECRPGITGKRCDTCLPYQFGFGRDGCKHCDCDTIGSQELQCDASGQCPCLTNVEGRRCDRCKENKYNRQYGCIDCPPCYNLIQDSVNQHRRRLNELESTLRKINNSPTVMKDSDFEKELKNVENRVKSLLQVAKQGSGNENKTLVEQLDELRDQLNQIEKISQSVDATAEDARRTTNEGLTSIEEAERVLDQIYEQLTEAEDYLATDGARALAAAKKRADQVGQQNQQMTIIAQEARVLADLNTNEAKKIHVLAEQARNTSLEAYNLAKKAIAKYSNISDEIRGLENKLELLEDRFNEVKNLTAAAVAKSAAVDKEALQLLILDLRVPAVDTNELRILLETVSVDGSEIKEQAQLLLGQNEAWLNELANKARKSEELLERAQDQQAATADLLSEVDGANEKAKDALKRGNQTLVEAQETLKKLGEFDAEVQKERIKAQEALTVLEEIKDMVNEAIAKANETESVLKDAESNAIAAKDIAIQAQVSNNADEASANANLIRQEANKTKLDAVRLGNEADKLHLRVEITNSIAKKHEARVDKDVNATNEVNHQVGQARNSLNLAGQQVDKALAEVDEIIKELDVLPEIDDADLDRLEERLLAAEKEIEEANLEKRIRELTEAKNLQTQWVKNYEDEVSRLRLEVENIDDIRKALPSICYKRLRLEP
>EOG090X00BV
MFSIFTASDVRNLSVLKISTPLSFNILGHPLKGGLYDPALGPLNDRSDPCGTCGEGTIQCMGHFGHIELPVPVVNPLFHKVLTSLLKLSCLKCYTLQIPSYLKLLLNGKLRLMEEGFSNDIPGLEQEVGSAVAGMNRIAEGELEFISDIIEAYIEMTCNQRHHVQSGKSKESTSTRTLNMEWHHYIESVVKTCKASKLCINCRNPIPKMTILKNKILTNHVVNNEDTMMEDRVIHKLETSFMTPDQSKKHLRGLWQKEADILRIIIPCLGSVDLEFPTDVFFFEIIPVLPPITRPVNMLDNQLVEHPQSQVYKSIIQDCLVLRNIIQTIQDGDTTQLPEEGRAVFDEIRGDNAAEKLHHAWTTLQSNVDHLMDREMSKTTESANCHGLKQVIEKKEGIIRMHMMGKRVNYAARSVITPDPNLNIDEIGVPEAFALKLTYPVPVTPWNVTELRKLIINGPEIHPGAVMIEGEDGFVKLLRGDDKTQLEAIAKRLLTSSRKPFSGIKIVHRHLQNGDMLLLNRQPTLHKPSIMAHKARILKGEKTLRLHYANCKAYNADFDGDEMNAHFPQNELARSEGYFIANVSNQYLVPKDGTPLGGLIQDHVISGVRLTLRGNFFNRQDYMQLVYSAIADTTGDLILLPPTILKPVRLWSGKQIISTVIINLTPRGRAPINLKASAKISVKDWQVKKARKWKCGQEFTDQRTMSEAEVVIRGGELLSGVLDKTHYGATPYGLIHCLFELYGGTCSSKVLSAFGKLFQTYLQISGFTLGVEDILVVRKSDQKRREIIEACRQIGDQIQTATVELPPGTSEEQVKSKMEESYAKDPKFRAIVDRKYKSALDVFTNNINKTCLPAGLLKKFPHNNLQLMVQSGAKGSTVNTMQISCLLGQIELEGKRPPLMINGKSLPSFPAYDSSPRSGGFIDGRFMTGIQPQEFFFHCMAGREGLIDTAVKTSRSGYLQRCLIKHLEGLTVNYDSTVRDSDGSLIQMSYGEDGLDIPNSRFLRKEELDFLVENRKAIVDPALVEHLKDETTEKIRKINKKIRKWRTKHGNGSTKWRNSEFAKFSEINRNSGSSKNRQINSNCGRTKAALSLMKKWIRADEEVKKKLKDECVRCPDPVTSIFRQDLQFGVLTEKMEALMEEYLDEKSRRFTTSIGKEEVRDLLCTKIMKSLCPPGEPVGLLAAQSIGEPSTQMTLNTFHFAGRGEMNVTLGIPRLREILMMASKNIKTPSMEIPFRTDLPNVENQATKLQLKLTKCYLSNILKNIKLDRKLEENPNRQLTFTLTVNCLPHKFYKNEYCVKPHNVLNEIERNFFKLFFRAIKKIGKATGTLLHIEEEKSSSREDDAMLDTGEPDETEAKPNRSDLGELHESSDEDEAAEDADATASRSIARHRENQEYEDPEEEEIEDAAPREPEDEENPQNPTNLPPEDEDDLDQPMCVADELITEQRKKDVVNMHPYALDYDYDSEKFLWCKLTFWLPLRMCRLDLPTILRTVAEKVVLWETPAIKRAFTFQNSEGETILKTDGLNIVEMFKYAQILDLHKLYTNDIYGVSRTYGIEAANRVILKEVKDVFKMYGITVDSRHLSLIADYMTFDGTFQPLSRKGMEDSASPLQQMSFEASLNFLKNATLQGKHDDLMSPSSRLMVGQPCKTGTGAFNVLFKMNNTAVSM
Someone could help me?
Thanks you

Related

Create an ISO9660 compliant filename using pycdlib

I'm trying to implement the pycdlib example-creating-new-basic-iso example shown below. About half way down there is a line that reads, iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.;1'). This writes a new file to the ISO that will be names "FOO" in the root directory of the iso. This example works for me.
Building on the example, I'm trying to change the filename inside the iso from "/FOO", to "/FOO.txt" but I keep getting the error, PyCdlibInvalidInput: ISO9660 filenames must consist of characters A-Z, 0-9, and _. How do I write an ISO9660 compliant filename with pycdlib with ".txt" in it?
Example code:
try:
from cStringIO import StringIO as BytesIO
except ImportError:
from io import BytesIO
import pycdlib
iso = pycdlib.PyCdlib()
iso.new()
foostr = b'foo\n'
iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.;1')
iso.add_directory('/DIR1')
iso.write('new.iso')
iso.close()
The key here is in the error: PyCdlibInvalidInput: ISO9660 filenames must consist of characters A-Z, 0-9, and _, but there is a more complete [explanation]
(https://wiki.osdev.org/ISO_9660#Filenames):
d-characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _
Filenames must use d-character encoding (strD), plus dot and semicolon which have to occur exactly once per filename. Filenames are composed of a File Name, a dot, a File Name Extension, a semicolon; and a version number in decimal digits. The latter two are usually not displayed to the user.
There are three Levels of Interchange defined. Level 1 allows filenames with a File Name length of 8 and an extension length of 3 (like MS-DOS). Levels 2 and 3 allow File Name and File Name Extension to have a combined length of up to 30 characters.
The ECMA-119 Directory Record format can hold composed names of up to 222 characters. This would violate the specs but must nevertheless be handled by a reader of the filesystem.
You can't name the file FOO.txt because lowercase letters aren't included in the d-characters. You need to capitalize the extension in order to be ISO9660-compliant.
iso.add_fp(BytesIO(foostr), len(foostr), '/FOO.TXT;1')

python alternative for awk?

I have two fasta files, and I want to search for sequence IDs and assign only the sequence corresponding to the ID to a string in Python.
I currently have:
import os
#use awk on the command line to search reference file and cut the reference sequence
os.system("awk '/LOC_OS05G45410.1/{getline;print}' Ref_seqs.fasta > sangerRef")
#use awk on the command line to cut the aligned sequence
os.system("awk '/seq1/{getline;print}' Sanger_seq_1.fasta > sangerAlign")
Ref_seq = open('sangerRef', 'r').read()
Sanger_seq = open('sangerAlign', 'r').read()
When I print these variables, everything looks fine:
TGGTGAGGCTTTTGACAGGGTTGAGCTGAGCCTGGTCTCCCTGGAGAAACTCTTCCAGAGAGCAAATGATGCTTGCACAGCTGCTGAAGAAATGTACTCCCATGGTCATGGTGGTACTGAACCCAG
CTGCTGCCCAAGTACTTCAAGCACAACAACTTCTCCAGCTTCATCAGGCAGCTCAACGCCTACGGTTTCCGAAAAATCGATCCTGAGAGATGGGAGTTCGCAAACGAGGATTTCATAAGAGGGCACACGCACCTT
However, when I try to read these variables into another function, it doesn't work:
from Bio import pairwise2
from Bio.Align import substitution_matrices
#load sequences
s1=Ref_seq
s2=Sanger_seq
matrix = substitution_matrices.load("NUC.4.4")
gap_open = -10
gap_extend = -0.5
align = pairwise2.align.globalds(s1, s2, matrix, gap_open, gap_extend)
align
I'm thinking it might be better to replace the awk command with a Python command?
I think it's because you haven't parsed the sequences. I don't know if I am using the word 'Parse' right, though.
I think this should work
from Bio import SeqIO
s1 = SeqIO.read('filepath/filename.fasta','fasta')
s2 = SeqIO.read('filepath/file.fasta','fasta')
matrix = substitution_matrices.load("NUC.4.4")
gap_open = -10
gap_extend = -0.5
align = pairwise2.align.globalds(s1.seq, s2.seq, matrix, gap_open, gap_extend)
align
The immediate problem is that read() returns all the lines with a newline at the end of each.
But indeed, your Awk commands should be trivial to replace with native Python.
def getseq(filename, search):
with open(filename) as reffile:
for line in reffile:
if search in line:
return seqfile.__next__().rstrip('\n')
s1 = getseq("Ref_seqs.fasta", "LOC_OS05G45410.1")
s2 = getseq("Sanger_seq_1.fasta", "seq1")
Probably BioPython already contains a better function for doing this. In particular, your Awk script (and hence this blind reimplementation) assumes that each sequence only occupies one line in the file.

Extract a string between other two in Python

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:
I open the fdf file with the following command:
import re
import os
os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()
With the opened file, I create a string called cadena with all the info I need. For example:
cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
I try to extract the needed info with the following line:
a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
Trying to get:
a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"
But I got:
a = []
The problem is in the line a = re.findall(r"nendobj(.*?)W 3\.0",cadena) but I don't realize where. I have tried many combinations with no success.
I appreciate any comment.
Regards
It seems to me that there are 2 problems:
a) you are looking for nendobj, but the N is actually part of the line break \n. Thus you'll also not get a leading N in the output, because there is no N.
b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag
Final code:
a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)
Also note, that there will be a second result, confirmed by Regex101.

renaming pcraster mapstack

I have a folder filled with 20 years precipitation pcraster mapstack in days, I've managed to extract from the original netcdf file precipitation value for my interest area and rename it into this to avoid confusion
precip.19810101
precip.19810102
precip.19810103
precip.19810104
precip.19810105
...
precip.20111231
but after that, I want to rename all of my files into pcraster mapstack based on this sequence of dates
precip00.001
precip00.002
precip00.003
precip00.004
...
I'm a beginner in python, is there any help or example for me to figure it out how to do this?
Thank you
Here's something I put together, based on some old Python scripts I once wrote:
#! /usr/bin/env python
# Rename PCRaster map stack with names following prefix.yyymmmdd to stack with valid
# PCRaster time step numbers
# Johan van der Knijff
#
# Example input stack:
#
# precip.19810101
# precip.19810102
# precip.19810103
# precip.19810104
# precip.19810105
#
# Then run script with following arguments:
#
# python renpcrstack.py precip 1
#
# Result:
#
# precip00.001
# precip00.002
# precip00.003
# precip00.004
# precip00.005
#
import sys
import os
import argparse
import math
import datetime
import glob
# Create argument parser
parser = argparse.ArgumentParser(
description="Rename map stack")
def parseCommandLine():
# Add arguments
parser.add_argument('prefix',
action="store",
type=str,
help="prefix of input map stack (also used as output prefix)")
parser.add_argument('stepStartOut',
action="store",
type=int,
help="time step number that is assigned to first map in output stack")
# Parse arguments
args = parser.parse_args()
return(args)
def dateToJulianDay(date):
# Calculate Julian Day from date
# Source: https://en.wikipedia.org/wiki/Julian_day#Converting_Julian_or_Gregorian_calendar_date_to_Julian_day_number
a = (14 - date.month)/12
y = date.year + 4800 - a
m = date.month +12*a - 3
JulianDay = date.day + math.floor((153*m + 2)/5) + 365*y + math.floor(y/4) \
- math.floor(y/100) + math.floor(y/400) - 32045
return(JulianDay)
def genStackNames(prefix,start,end, stepSize):
# Generate list with names of all maps
# map name is made up of 11 characters, and chars 8 and 9 are
# separated by a dot. Name starts with prefix, ends with time step
# number and all character positions in between are filled with zeroes
# define list that will contain map names
listMaps = []
# Count no chars prefix
charsPrefix = len(prefix)
# Maximum no chars needed for suffix (end step)
maxCharsSuffix = len(str(end))
# No of free positions between pre- and suffix
noFreePositions = 11 - charsPrefix - maxCharsSuffix
# Trim prefix if not enough character positions are available
if noFreePositions < 0:
# No of chars to cut from prefix if 11-char limit is exceeded
charsToCut = charsPrefix + maxCharsSuffix - 11
charsToKeep = charsPrefix - charsToCut
# Updated prefix
prefix = prefix[0:charsToKeep]
# Updated prefix length
charsPrefix = len(prefix)
# Generate name for each step
for i in range(start,end + 1,stepSize):
# No of chars in suffix for this step
charsSuffix = len(str(i))
# No of zeroes to fill
noZeroes = 11 - charsPrefix - charsSuffix
# Total no of chars right of prefix
charsAfterPrefix = noZeroes + charsSuffix
# Name of map
thisName = prefix + (str(i)).zfill(charsAfterPrefix)
thisFile = thisName[0:8]+"." + thisName[8:11]
listMaps.append(thisFile)
return listMaps
def main():
# Parse command line arguments
args = parseCommandLine()
prefix = args.prefix
stepStartOut = args.stepStartOut
# Glob pattern for input maps: prefix + dot + 8 char extension
pattern = prefix + ".????????"
# Get list of all input maps based on glob pattern
mapsIn = glob.glob(pattern)
# Set time format
tfmt = "%Y%m%d"
# Set up dictionary that will act as lookup table between Julian Days (key)
# and Date string
jDayDate = {}
for map in mapsIn:
baseNameIn = os.path.splitext(map)[0]
dateIn = os.path.splitext(map)[1].strip(".")
# Convert to date / time format
dt = datetime.datetime.strptime(dateIn, tfmt)
# Convert date to Julian day number
jDay = int(dateToJulianDay(dt))
# Store as key-value pair in dictionary
jDayDate[jDay] = dateIn
# Number of input maps (equals number of key-value pairs)
noMaps = len(jDayDate)
# Create list of names for output files
mapNamesOut = genStackNames(prefix, stepStartOut, noMaps + stepStartOut -1, 1)
# Iterate over Julian Days (ascending order)
i = 0
for key in sorted(jDayDate):
# Name of input file
fileIn = prefix + "."+ jDayDate[key]
# Name of output file
fileOut = mapNamesOut[i]
# Rename file
os.rename(fileIn, fileOut)
print("Renamed " + fileIn + " ---> " + fileOut)
i += 1
main()
(Alternatively download the code from my Github Gist.)
You can run it from the command line, using the prefix of your map stack and the number of the first output map as arguments, e.g.:
python renpcrmaps.py precip 1
Please note that the script renames the files in place, so make sure to make a copy of your original map stack in case something goes wrong (I only did some very limited testing on this!).
Also, the script assumes a non-sparse input map stack, i.e. in case of daily maps, an input map exists for each day. In case of missing days, the numbering of the output maps will not be what you'd expect.
The internal conversion of all dates to Julian Days may be a bit overkill here, but once you start doing more advanced transformations it does make things easier because it gives you decimal numbers which are more straightforward to manipulate than date strings.
as you gave the [batch-file] tag, I assume, Batch is ok:
#echo off
setlocal enabledelayedexpansion
set /a counti=0
for /f "delims=" %%a in ('dir /b /on precip.*') do (
set /a counti+=1
set "counts=000000000!counti!"
ECHO ren "%%a" "precip!counts:~-6,3!.!counts:~-3!"
)
remove the ECHO after successfully checking the Output
EDITED to match your precip00.999 is precip01.000 ... until precip07.300 requirement (in your question it's precip000.001 in your comment it's precip00.001 - I decided to use the first Format, can easily be changed to ECHO ren "%%a" "precip!counts:~-5,2!.!counts:~-3!" for the second Format.). Although it's not Batch anymore, I'll leave the answer, maybe you can at least use the logic.
If you are not firm with Batch, the %variable:~-6,3% Syntax is explained with set /?
I've faced this issue a short while ago. Please note I am new both to python and PCRaster so do not take me example without check.
import os
import shutil
import fnmatch
import subprocess
from os import listdir
from os.path import isfile, join
from shutil import copyfile
TipeofFile = 'precip.????????' # original file
Files = []
for iListFile in sorted(os.listdir('.')):
if fnmatch.fnmatch(iListFile, TipeofFile):
Files.append(iListFile)
digiafter = 3 #after the point: .001, .002, 0.003
digitTotal = 8 #total: precipi00000.000 (5.3)
for j in xrange(0, len(Files)):
num = str(j + 1)
nameFile = Files[j]
putZeros = digitTotal - len(num)
for x in xrange(0,putZeros):
num = "0" + num
precip = num[0:digitTotal-digiafter]+ '.' +num[digitTotal-digiafter:digitTotal]
precip = str(precip)
precip = 'precip' + precip
copyfile(nameFile, precip)

BioPython AlignIO ValueError says strings must be same length?

Input fasta-format text file:
http://www.jcvi.org/cgi-bin/tigrfams/DownloadFile.cgi?file=/opt/www/www_tmp/tigrfams/fa_alignment_PF00205.txt
#!/usr/bin/python
from Bio import AlignIO
seq_file = open('/path/to/fa_alignment_PF00205.txt')
alignment = AlignIO.read(seq_file, "fasta")
Error:
ValueError: Sequences must all be the same length
The input sequences shouldn't have to be the same length since on ClustalOmega you can align sequences of differing lengths.
This also doesn't work...gets the same error:
alignment = AlignIO.parse(seq_file,"fasta")
for record in alignment:
print(record.id)
Does anybody who is familiar with BioPython know how to get around this to align sequences from fasta files?
Pad the sequence that is too short and write the records to to a temporary FASTA file. Than your alignments works as expected:
from Bio import AlignIO
from Bio import SeqIO
from Bio import Seq
import os
input_file = '/path/to/fa_alignment_PF00205.txt'
records = SeqIO.parse(input_file, 'fasta')
records = list(records) # make a copy, otherwise our generator
# is exhausted after calculating maxlen
maxlen = max(len(record.seq) for record in records)
# pad sequences so that they all have the same length
for record in records:
if len(record.seq) != maxlen:
sequence = str(record.seq).ljust(maxlen, '.')
record.seq = Seq.Seq(sequence)
assert all(len(record.seq) == maxlen for record in records)
# write to temporary file and do alignment
output_file = '{}_padded.fasta'.format(os.path.splitext(input_file)[0])
with open(output_file, 'w') as f:
SeqIO.write(records, f, 'fasta')
alignment = AlignIO.read(output_file, "fasta")
print alignment
This outputs:
SingleLetterAlphabet() alignment with 104 rows and 275 columns
TKAAIELIADHQ.......LTVLADLLVHRLQ..AVKELEALLA...QAL SP|A2VGF0.1/208-339
LQELASVINQHE...KV..MLFCGHGCR...Y..AVEEVMALAK...EDL SP|A3D4X6.1/190-319
IKKIAQAIEKAK...KP..VICAGGGVINS.N..ASEELLTLSR...KEL SP|A3DID9.1/192-327
IDEAAEAINKAE...RP..VILAGGGVSIA.G..ANKELFEFAT...QLL SP|A3DIY4.1/192-327
IEKAIELINSSQ...RP..FICSGGGVISS.E..ASEELIQFAE...KIL SP|A4XHS0.1/191-326
IKRAVEAIENSQ...RP..VICSGGGVIAS.R..ASDELKILVE...SEI SP|A4XIL5.1/194-328
VRQAARIIMESE...RP..VIYAGGGVRIS.G..AAPELLELSE...RAL SP|A5D4V9.1/192-327
LQALAQRILRAQ...RP..VIITGDEIVKS.D..ALQAAADFAS...LQL SP|A5ECG1.1/192-328
VEKAVELLWSAR...RV..LVISGRGAR...G..AGPELIGLLD...RAM SP|A5EDH4.1/198-324
IQKAARLIETAE...KP..VIIAGHGVNIS.G..ANEELKTLAE...KSL SP|A5FR34.1/193-328
LDALARDLDSAA...RV..TIYAGIGAR...G..AAARVVQLAG...EAL SP|A5FTR0.1/189-317
VADVAALLRAAR...RP..VIVAGGGVIHSG...AEERLATFAA...DAL SP|A5G0X6.1/217-351
IAEAVSALKGAK...RP..IIYTGGGLINS.GPESAELIVQLAK...RAL SP|A5G2E1.1/199-336
LKKAAEIINRAK...RP..LIYAGGGITLA.G..ASAELRALAA...ALL SP|A5GC69.1/192-327
CRDIVGKLLQSH...RP..VVLGGTGVRLS.R..TEQRLLALVE...DVF SP|A5W0I1.1/200-336
LDQAALKLAAAE...RP..MIIAGGGA..L.H..AAEQLAQLSA...AGL SP|A5W220.1/196-326
LQRAADILNTGH...KV..AILVGAGAL...Q..ATEQVIAIAE...RAL SP|A5W364.1/198-328
IRKAAEMLLAAK...RP..VVYSGGGVILG.G..GSEALTEIAK...SEM SP|A5W954.1/196-331
...
LTELQERLANAQ...RP..VVILGGSRWSD.A..AVQQFTRFAE...... SP|Q220C3.1/190-328
your problem is last record of fasta ... tail -9 fa_alignment_PF00205.txt
>SP|Q21VK8.1/229-357
LQAALAALAKAE...RP..LLVIGSQALVLSK..QAEHLAEAVARL.GIPV.YLSGMA..RGLLG.R..........DH.
...............PLQ..................MRHQRRQALRE..ADCVLLAG.VP...CDFRLD......YGKHV
RR..............S.AT.........L..IAA.N......................RSA.........KDARLNR..
.......K...PD.IAAIGDAG.......LFLQAL
>SP|Q220C3.1/190-328
LTELQERLANAQ...RP..VVILGGSRWSD.A..AVQQFTRFAEAF.SLPV.FCSFRR..QMLFS.A..........NH.
...............ACY...AG.DLGLG.A.....NQRLLARI.RQ..SDLILLLG.GR...MSEVPS......QGYEL
LGIPAPQQ...........D
Sequence with id SP|Q220C3.1/190-328 has different length than other sequences

Categories

Resources