How to convert multiple fasta lines in a matrix in python? - python

I have a file (txt or fasta) like this. Each sequence is located only in a single line.
>Line1
ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC
>Line2
ATTGCGCTANANAGCTANANCGATAGANCACGAAAGAGATAGACTATAGC
>Line3
ATCGCGCTANANAGCTANANGGCTAGANCNCGAAAGNGATAGACTATAGC
>Line4
ATTGCGCTANANAGCTANANGGATAGANCACGAGAGAGATAGACTATAGC
>Line5
ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC
I have to get a matrix in which each position correspond to each of the letters (nucleotides) of the sequences. In this case a matrix of (5x50).
I've been dealing with numpy methods. I hope someone could help me.

If you are working with DNA sequence data in python, I would recommend using the Biopython library. You can install it with pip install biopython.
Here is how you would achieve your desired result:
from Bio import SeqIO
import os
import numpy as np
pathToFile = os.path.join("C:\\","Users","Kevin","Desktop","test.fasta") #windows machine
allSeqs = []
for seq_record in SeqIO.parse(pathToFile, """fasta"""):
allSeqs.append(seq_record.seq)
seqMat = np.array(allSeqs)
But in the for loop, each seq_record.seq is a Seq object, giving you the flexibility to perform operations on them.
In [5]: seqMat.shape
Out[5]: (5L, 50L)
You can slice your seqMat array however you like.
In [6]: seqMat[0]
Out[6]: array(['A', 'T', 'C', 'G', 'C', 'G', 'C', 'T', 'A', 'N', 'A', 'N', 'A',
'G', 'C', 'T', 'A', 'N', 'A', 'N', 'A', 'G', 'C', 'T', 'A', 'G',
'A', 'N', 'C', 'A', 'C', 'G', 'A', 'T', 'A', 'G', 'A', 'G', 'A',
'G', 'A', 'G', 'A', 'C', 'T', 'A', 'T', 'A', 'G', 'C'],
dtype='|S1')
Highly recommend checking out the tutorial though!

I hope this short bit of code helps. You basically need to split the string into a character array. After that you just put everything into a matrix.
Line1 = "ATGC"
Line2 = "GCTA"
Matr1 = np.matrix([n for n in Line1], [n for n in Line2])
Matr1[0,0] will return the first element in your matrix.

One way of achieving the matrix is to read the content of the file and converting it into a list where each element of the list is the sequence present in each line.And then you can access your matrix as a 2D Data Structure.
Ex: [ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC]
filePath = "file path containing the sequence"
List that store the sequence as a matrix
listFasta =list ((open(filePath).read()).split("\n"))
for seq in listFasta:
for charac in seq:
print charac
Another way to access each element of your matrix
for seq in range(len(listFasta)):
for ch in range(len(listFasta[seq])):
print listFasta[seq][ch]

Related

How to split a string into a list of predefined substrings of different lengths?

Given a collection of predefined strings of unequal length, input a string, and split the string into occurrences of elements in the collection, the output should be unique for every input, and it should prefer the longest possible chunks.
For example, it should split s, c, h into different chunks, unless they are adjacent.
If "sc" appear together, it should be grouped into 'sc' and not as 's', 'c', similarly if "sh" appears then it must be grouped into 'sh', if "ch" appears then it should be grouped into 'ch', and finally "sch" should be grouped into 'sch'.
I only know string.split(delim) splits on specified delimiter, and re.split('\w{n}', string) splits string into chunks of equal lengths, both these methods don't give the intended result, how can this be done?
Pseudo code:
def phonemic_splitter(string):
phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
output = do_something(string)
return output
And example outputs:
phonemic_splitter('case') -> ['c', 'a', 's', 'e']
phonemic_splitter('ash') -> ['a', 'sh']
phonemic_splitter('change') -> ['ch', 'a', 'n', 'g', 'e']
phonemic_splitter('schane') -> ['sch', 'a', 'n', 'e']
Here is a possible solution:
def phonemic_splitter(s, phonemes):
phonemes = sorted(phonemes, key=len, reverse=True)
result = []
while s:
result.append(next(filter(s.startswith, phonemes)))
s = s[len(result[-1]):]
return result
This solution relies on the fact that phonemes contains a list of all the possible phonemes that can be found within the string s (otherwise, next could raise an exception).
One could also speed up this solution by implementing a binary search to be used in place of next.
You could use a regex:
import re
cases=['case', 'ash', 'change', 'schane']
for e in cases:
print(repr(e), '->', re.findall(r'sch|sh|ch|[a-z]', e))
Prints:
'case' -> ['c', 'a', 's', 'e']
'ash' -> ['a', 'sh']
'change' -> ['ch', 'a', 'n', 'g', 'e']
'schane' -> ['sch', 'a', 'n', 'e']
You could incorporate into your function this way:
import re
def do_something(s, splits):
pat='|'.join(sorted(
[f'{x}' for x in splits if len(x)>1],
key=len, reverse=True))+'|[a-z]'
return re.findall(pat, s)
def phonemic_splitter(string):
phonemes = ['a', 'sh', 's', 'g', 'n', 'c', 'e', 'ch', 'sch']
output = do_something(string, phonemes)
return output

How to iterate over position through Numpy- Python

I am wondering if there is a way to iterate over individual positions in a sequence list using NumPy. For example, if I had a list of sequences:
a = ['AGHT','THIS','OWKF']
The function would be able to go through each individual characters in their position. So for the first sequence 'AGHT', it would be broken down into 'A','G','H','T'. The ultimate goal is to create individual grids based on character abundance in each one of these sequences. So far I have only been able to make a loop that goes through each character, but I need this in NumPy:
b = np.array(a)
for c in b:
for d in c:
print(d)
I would prefer this in NumPy, but if there are other ways I would like to know as well. Thanks!
list expands a string into a list:
In [406]: a = ['AGHT','THIS','OWKF']
In [407]: [list(item) for item in a]
Out[407]: [['A', 'G', 'H', 'T'], ['T', 'H', 'I', 'S'], ['O', 'W', 'K', 'F']]
You can use join() to join the array into a sequence of characters, then iterate over each character or print it like this:
>>> a = ['AGHT','THIS','OWKF']
>>> print(''.join(a))
'AGHTTHISOWKF'
Or to turn it into an array of individual characters:
>>> out = ''.join(a)
>>> b = np.array(list(out))
array(['A', 'G', 'H', 'T', 'T', 'H', 'I', 'S', 'O', 'W', 'K', 'F'],
dtype='<U1')

How to convert a numpy.ndarray type into a list?

I want to read a matfile in python and then export the data in a database. in order to do this I need to have the data type as list in python. I wrote the code below:
import scipy.io as si
import csv
a = si.loadmat('matfilename')
b = a['variable']
list1=b.tolist()
The variable has 1 row and 15 columns. when I print list1, I get the answer below: (It is indeed a list, but a list that contains only one element. It means when I call list1[0], I get the same result.):
[[array(['A'],
dtype='<U13'), array(['B'],
dtype='<U14'), array(['C'],
dtype='<U6'), array(['D'],
dtype='<U4'), array(['E'],
dtype='<U10'), array(['F'],
dtype='<U13'), array(['G'],
dtype='<U11'), array(['H'],
dtype='<U9'), array(['I'],
dtype='<U16'), array(['J'],
dtype='<U18'), array(['K'],
dtype='<U16'), array(['L'],
dtype='<U16'), array(['M'],
dtype='<U16'), array(['N'],
dtype='<U14'), array(['O'],
dtype='<U13')]]
While the form that I expect is:
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O']
Does anyone know what the problem is?
To my experience, that is just like MATLAB files are structured, only nested arrays.
You can create the list yourself:
>>> [x[0][0] for x in list1[0]]
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O']

Sorting a list using an alphabet string

I'm trying to sort a list containing only lower case letters by using the string :
alphabet = "abcdefghijklmnopqrstuvwxyz".
that is without using sort, and with O(n) complexity only.
I got here:
def sort_char_list(lst):
alphabet = "abcdefghijklmnopqrstuvwxyz"
new_list = []
length = len(lst)
for i in range(length):
new_list.insert(alphabet.index(lst[i]),lst[i])
print (new_list)
return new_list
for this input :
m = list("emabrgtjh")
I get this:
['e']
['e', 'm']
['a', 'e', 'm']
['a', 'b', 'e', 'm']
['a', 'b', 'e', 'm', 'r']
['a', 'b', 'e', 'm', 'r', 'g']
['a', 'b', 'e', 'm', 'r', 'g', 't']
['a', 'b', 'e', 'm', 'r', 'g', 't', 'j']
['a', 'b', 'e', 'm', 'r', 'g', 't', 'h', 'j']
['a', 'b', 'e', 'm', 'r', 'g', 't', 'h', 'j']
looks like something goes wrong along the way, and I can't seem to understand why.. if anyone can please enlighten me that would be great.
You are looking for a bucket sort. Here:
def sort_char_list(lst):
alphabet = "abcdefghijklmnopqrstuvwxyz"
# Here, create the 26 buckets
new_list = [''] * len(alphabet)
for letter in lst:
# This is the bucket index
# You could use `ord(letter) - ord('a')` in this specific case, but it is not mandatory
index = alphabet.index(letter)
new_list[index] += letter
# Assemble the buckets
return ''.join(new_list)
As for complexity, since alphabet is a pre-defined fixed-size string, searching a letter in it is requires at most 26 operations, which qualifies as O(1). The overall complexity is therefore O(n)

Find specific character with python regex

I have a list of strings looking like this:
H PL->01 Tx=000/006 Ph=00/000 DGDD DDDR YDyD GRDD YGR Dets= 003,003,003,003,003,003,003,003,003,003,003,003, ports= 255,255,255,255,255,255,255,255,'
I want to be able to extract the content tha matches DGDD DDDR YDyD GRDD YGR(this changes but always has the letters D,G,R,Y,y and its length may change) and put it in a list without whitespaces like this:
['D', 'G', 'D', 'D', 'D', 'D', 'D', 'R', 'Y', 'D', 'y', 'D', 'G', 'R', 'D', 'D', 'Y', 'G', 'R']
If the criteria is groups of DGRYy that have at least three characters, then you can use a regex to that effect and then "flatten" it to a list after... eg:
import re
from itertools import chain
print list(chain.from_iterable(re.findall('[DGRYy]{3,}', data)))
# ['D', 'G', 'D', 'D', 'D', 'D', 'D', 'R', 'Y', 'D', 'y', 'D', 'G', 'R', 'D', 'D', 'Y', 'G', 'R']
If it's always between two items, then it's possible to use the builtin string functions to extract it, eg:
print [ch for ch in data[data.index('Ph'):].partition('Dets=')[0].split(' ', 1)[1] if ch != ' ']

Categories

Resources