I was wondering what the best way to split up a string such as "abcdefghijklmnopqrstuvwxyz" into a list of groups of 2 (splitting the string into: ["ab", "bc", "cd", ... "yz]) is in Python.
Or what about in groups of 3: splitting the string into ["abc", "bcd", "cde", ... "xyz"]
Thanks!
Here is one that works for any iterable
>>> from itertools import tee, izip, islice
>>> chunksize = 2
>>> s = 'abcdefghijklmnopqrstuvwxyz'
>>> t = tee(s, chunksize)
>>> for i, j in enumerate(t):
... next(islice(j, i, i), None)
...
>>> ["".join(k) for k in izip(*t)]
['ab', 'bc', 'cd', 'de', 'ef', 'fg', 'gh', 'hi', 'ij', 'jk', 'kl', 'lm', 'mn', 'no', 'op', 'pq', 'qr', 'rs', 'st', 'tu', 'uv', 'vw', 'wx', 'xy', 'yz']
If s is always a str, this is more straight forward
>>> [s[i: i + chunksize] for i in range(len(s) + 1 - chunksize)]
['ab', 'bc', 'cd', 'de', 'ef', 'fg', 'gh', 'hi', 'ij', 'jk', 'kl', 'lm', 'mn', 'no', 'op', 'pq', 'qr', 'rs', 'st', 'tu', 'uv', 'vw', 'wx', 'xy', 'yz']
This is a one-liner:
def split_by_len(text, chunksize):
return [text[i:(i+chunksize)] for i in range(len(text)-chunksize+1)]
def segment_string(s, segment_len):
return [s[i:i+segment_len] for i in range(len(s) - (segment_len - 1))]
>>> for i in range(5):
print segment_string("abcdefghijklmnopqrstuvwxyz", i)
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
['ab', 'bc', 'cd', 'de', 'ef', 'fg', 'gh', 'hi', 'ij', 'jk', 'kl', 'lm', 'mn', 'no', 'op', 'pq', 'qr', 'rs', 'st', 'tu', 'uv', 'vw', 'wx', 'xy', 'yz']
['abc', 'bcd', 'cde', 'def', 'efg', 'fgh', 'ghi', 'hij', 'ijk', 'jkl', 'klm', 'lmn', 'mno', 'nop', 'opq', 'pqr', 'qrs', 'rst', 'stu', 'tuv', 'uvw', 'vwx', 'wxy', 'xyz']
['abcd', 'bcde', 'cdef', 'defg', 'efgh', 'fghi', 'ghij', 'hijk', 'ijkl', 'jklm', 'klmn', 'lmno', 'mnop', 'nopq', 'opqr', 'pqrs', 'qrst', 'rstu', 'stuv', 'tuvw', 'uvwx', 'vwxy', 'wxyz']
>>> groups = zip(s, s[1:], s[2:])
>>> ["".join(g) for g in groups]
['abc', 'bcd', 'cde', 'def', 'efg', 'fgh', 'ghi', 'hij', 'ijk', 'jkl', 'klm', 'l
mn', 'mno', 'nop', 'opq', 'pqr', 'qrs', 'rst', 'stu', 'tuv', 'uvw', 'vwx', 'wxy'
, 'xyz']
Related
str1 = "ABCDEF"
I want to find a list of all substrings of length 3 in the above string including overlap
For example:
list1 = ['ABC','BCD','CDE','DEF']
I tried the following but it misses the overlap:
n = 3
lst = [str1[i:i+n] for i in range(0, len(str1), n)]
x = "ABCDEF"
print ([x[i:i+3] for i in range(len(x)-2)])
Output:
['ABC', 'BCD', 'CDE', 'DEF']
More generally:
x = "ABCDEF"
n = 2
print ([x[i:i+n] for i in range(len(x)-n+1)])
Output:
['AB', 'BC', 'CD', 'DE', 'EF']
Even more generally:
x = "ABCDEF"
for n in range(len(x)+1):
print ([x[i:i+n] for i in range(len(x)-n+1)])
Output:
['', '', '', '', '', '', '']
['A', 'B', 'C', 'D', 'E', 'F']
['AB', 'BC', 'CD', 'DE', 'EF']
['ABC', 'BCD', 'CDE', 'DEF']
['ABCD', 'BCDE', 'CDEF']
['ABCDE', 'BCDEF']
['ABCDEF']
I need some assistance in writing a code that will convert a given RNA nucleotide sequence into an Amino Acid sequence.
I've currently been given 2 dictionaries to use: one of Amino Acid codons and their respective 3-letter codes, and one of the 3-letter codes and their corresponding 1-letter code.
I need to write a code that will take a give RNA sequence and output the single letter code. Below I've included the 2 provided dictionaries.
RNA_codon_table = {
# U
'UUU': 'Phe', 'UCU': 'Ser', 'UAU': 'Tyr', 'UGU': 'Cys', # UxU
'UUC': 'Phe', 'UCC': 'Ser', 'UAC': 'Tyr', 'UGC': 'Cys', # UxC
'UUA': 'Leu', 'UCA': 'Ser', 'UAA': '---', 'UGA': '---', # UxA
'UUG': 'Leu', 'UCG': 'Ser', 'UAG': '---', 'UGG': 'Trp', # UxG
# C
'CUU': 'Leu', 'CCU': 'Pro', 'CAU': 'His', 'CGU': 'Arg', # CxU
'CUC': 'Leu', 'CCC': 'Pro', 'CAC': 'His', 'CGC': 'Arg', # CxC
'CUA': 'Leu', 'CCA': 'Pro', 'CAA': 'Gln', 'CGA': 'Arg', # CxA
'CUG': 'Leu', 'CCG': 'Pro', 'CAG': 'Gln', 'CGG': 'Arg', # CxG
# A
'AUU': 'Ile', 'ACU': 'Thr', 'AAU': 'Asn', 'AGU': 'Ser', # AxU
'AUC': 'Ile', 'ACC': 'Thr', 'AAC': 'Asn', 'AGC': 'Ser', # AxC
'AUA': 'Ile', 'ACA': 'Thr', 'AAA': 'Lys', 'AGA': 'Arg', # AxA
'AUG': 'Met', 'ACG': 'Thr', 'AAG': 'Lys', 'AGG': 'Arg', # AxG
# G
'GUU': 'Val', 'GCU': 'Ala', 'GAU': 'Asp', 'GGU': 'Gly', # GxU
'GUC': 'Val', 'GCC': 'Ala', 'GAC': 'Asp', 'GGC': 'Gly', # GxC
'GUA': 'Val', 'GCA': 'Ala', 'GAA': 'Glu', 'GGA': 'Gly', # GxA
'GUG': 'Val', 'GCG': 'Ala', 'GAG': 'Glu', 'GGG': 'Gly' # GxG
}
singleletter = {'Cys': 'C', 'Asp': 'D', 'Ser': 'S', 'Gln': 'Q', 'Lys': 'K',
'Trp': 'W', 'Asn': 'N', 'Pro': 'P', 'Thr': 'T', 'Phe': 'F', 'Ala': 'A',
'Gly': 'G', 'Ile': 'I', 'Leu': 'L', 'His': 'H', 'Arg': 'R', 'Met': 'M',
'Val': 'V', 'Glu': 'E', 'Tyr': 'Y', '---': '*'}
You can do this with a list comprehension:
[singleletter[RNA_codon_table[s[i:i+3]]] for i in range(0, len(s),3)]
For example,
>>> s = 'UUUGAUAGC'
>>> [s[i:i+3] for i in range(0, len(s),3)]
['UUU', 'GAU', 'AGC']
>>> [RNA_codon_table[s[i:i+3]] for i in range(0, len(s),3)]
['Phe', 'Asp', 'Ser']
>>> [singleletter[RNA_codon_table[s[i:i+3]]] for i in range(0, len(s),3)]
['F', 'D', 'S']
Or, with BioPython:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> s = Seq('UUUGAUAGC', IUPAC.unambiguous_rna)
>>> s.translate()
Seq('FDS', IUPACProtein())
I have the following letters:
Letters = ["a", "b", "c", "d", "e"]
What I would like is to write a generator function that will create strings that can be formed by taking a combination of any of the letters, preferably in some deterministic order like from smallest to biggest.
So for example if I were to run the generator 20 times I would get
a
b
c
d
e
aa
ab
ac
ad
ae
ba
bb
bc
bd
be
ca
cb
cc
cd
ce
da
How would I write this generator?
Generator function:
from itertools import *
def wordgen(letters):
for n in count(1):
yield from map(''.join, product(letters, repeat=n))
Usage:
for word in wordgen('abcde'):
print(word)
Output:
a
b
c
d
e
aa
ab
ac
ad
ae
ba
bb
bc
bd
be
ca
...
A self-made alternative without using itertools:
def wordgen(letters):
yield from letters
for word in wordgen(letters):
for letter in letters:
yield word + letter
Golf-version (admittedly starts with the empty string):
def w(s):yield'';yield from(w+c for w in w(s)for c in s)
Use the combinations functions from the itertools library. There's both combinations with replacement and without replacement
for item in itertools.combinations(Letters, 2):
print("".join(item))
https://docs.python.org/3.4/library/itertools.html
Use itertools.product():
from itertools import product, imap
letters = ["a", "b", "c", "d", "e"]
letters += imap(''.join, product(letters, repeat=2))
print letters
['a', 'b', 'c', 'd', 'e', 'aa', 'ab', 'ac', 'ad', 'ae', 'ba', 'bb', 'bc', 'bd', 'be', 'ca', 'cb', 'cc', 'cd', 'ce', 'da', 'db', 'dc', 'dd', 'de', 'ea', 'eb', 'ec', 'ed', 'ee']
I use a recursive generator function (without itertools)
Letters = ["a", "b", "c", "d", "e"]
def my_generator(list, first=""):
for letter in list:
yield first + letter
my_generators = []
for letter in list:
my_generators.append(my_generator(list, first + letter))
i = 0
while True:
for j in xrange(len(list)**(i/len(list)+1)):
yield next(my_generators[i%len(list)])
i+=1
gen = my_generator(Letters)
[next(gen) for c in xrange(160)]
you get
['a', 'b', 'c', 'd', 'e', 'aa', 'ab', 'ac', 'ad', 'ae', 'ba', 'bb',
'bc', 'bd', 'be', 'ca', 'cb', 'cc', 'cd', 'ce', 'da', 'db', 'dc',
'dd', 'de', 'ea', 'eb', 'ec', 'ed', 'ee', 'aaa', 'aab', 'aac', 'aad',
'aae', 'aba', 'abb', 'abc', 'abd', 'abe', 'aca', 'acb', 'acc', 'acd',
'ace', 'ada', 'adb', 'adc', 'add', 'ade', 'aea', 'aeb', 'aec', 'aed',
'aee', 'baa', 'bab', 'bac', 'bad', 'bae', 'bba', 'bbb', 'bbc', 'bbd',
'bbe', 'bca', 'bcb', 'bcc', 'bcd', 'bce', 'bda', 'bdb', 'bdc', 'bdd',
'bde', 'bea', 'beb', 'bec', 'bed', 'bee', 'caa', 'cab', 'cac', 'cad',
'cae', 'cba', 'cbb', 'cbc', 'cbd', 'cbe', 'cca', 'ccb', 'ccc', 'ccd',
'cce', 'cda', 'cdb', 'cdc', 'cdd', 'cde', 'cea', 'ceb', 'cec', 'ced',
'cee', 'daa', 'dab', 'dac', 'dad', 'dae', 'dba', 'dbb', 'dbc', 'dbd',
'dbe', 'dca', 'dcb', 'dcc', 'dcd', 'dce', 'dda', 'ddb', 'ddc', 'ddd',
'dde', 'dea', 'deb', 'dec', 'ded', 'dee', 'eaa', 'eab', 'eac', 'ead',
'eae', 'eba', 'ebb', 'ebc', 'ebd', 'ebe', 'eca', 'ecb', 'ecc', 'ecd',
'ece', 'eda', 'edb', 'edc', 'edd', 'ede', 'eea', 'eeb', 'eec', 'eed',
'eee', 'aaaa', 'aaab', 'aaac', 'aaad', 'aaae']
I am trying to do the following. The outer product of an array [a,b; c,d] with itself can be described as a 4x4 array of 'strings' of length 2. So in the upper left corner of the 4x4 matrix, the values are aa, ab, ac, ad. What's the best way to generate these strings in numpy/python or matlab?
This is an example for just one outer product. The goal is to handle k successive outer products, that is the 4x4 matrix can be multiplied again by [a,b; c,d] and so on.
You can obtain #Jaime's result in a much simpler way using np.char.array():
a = np.char.array(list('abcd'))
print(a[:,None]+a)
which gives:
chararray([['aa', 'ab', 'ac', 'ad'],
['ba', 'bb', 'bc', 'bd'],
['ca', 'cb', 'cc', 'cd'],
['da', 'db', 'dc', 'dd']],
dtype='|S2')
Using a funky mix of itertools and numpy you could do:
>>> from itertools import product
>>> s = 'abcd' # s = ['a', 'b', 'c', 'd'] works the same
>>> np.fromiter((a+b for a, b in product(s, s)), dtype='S2',
count=len(s)*len(s)).reshape(len(s), len(s))
array([['aa', 'ab', 'ac', 'ad'],
['ba', 'bb', 'bc', 'bd'],
['ca', 'cb', 'cc', 'cd'],
['da', 'db', 'dc', 'dd']],
dtype='|S2')
You can also avoid using numpy getting a little creative with itertools:
>>> from itertools import product, islice
>>> it = (a+b for a, b in product(s, s))
>>> [list(islice(it, len(s))) for j in xrange(len(s))]
[['aa', 'ab', 'ac', 'ad'],
['ba', 'bb', 'bc', 'bd'],
['ca', 'cb', 'cc', 'cd'],
['da', 'db', 'dc', 'dd']]
You could use list comprehensions in Python:
array = [['a', 'b'], ['c', 'd']]
flatarray = [ x for row in array for x in row]
outerproduct = [[y+x for x in flatarray] for y in flatarray]
Output: [['aa', 'ab', 'ac', 'ad'], ['ba', 'bb', 'bc', 'bd'], ['ca', 'cb', 'cc', 'cd'], ['da', 'db', 'dc', 'dd']]
To continue the discussion after Jose Varz's answer:
def foo(A,B):
flatA [x for row in A for x in row],
flatB = [x for row in B for x in row]
outer = [[y+x for x in flatA] for y in flatB]
return outer
In [265]: foo(A,A)
Out[265]:
[['aa', 'ab', 'ac', 'ad'],
['ba', 'bb', 'bc', 'bd'],
['ca', 'cb', 'cc', 'cd'],
['da', 'db', 'dc', 'dd']]
In [268]: A3=np.array(foo(foo(A,A),A))
In [269]: A3
Out[269]:
array([['aaa', 'aab', 'aac', 'aad', 'aba', 'abb', 'abc', 'abd', 'aca',
'acb', 'acc', 'acd', 'ada', 'adb', 'adc', 'add'],
['baa', 'bab', 'bac', 'bad', 'bba', 'bbb', 'bbc', 'bbd', 'bca',
'bcb', 'bcc', 'bcd', 'bda', 'bdb', 'bdc', 'bdd'],
['caa', 'cab', 'cac', 'cad', 'cba', 'cbb', 'cbc', 'cbd', 'cca',
'ccb', 'ccc', 'ccd', 'cda', 'cdb', 'cdc', 'cdd'],
['daa', 'dab', 'dac', 'dad', 'dba', 'dbb', 'dbc', 'dbd', 'dca',
'dcb', 'dcc', 'dcd', 'dda', 'ddb', 'ddc', 'ddd']],
dtype='|S3')
In [270]: A3.reshape(4,4,4)
Out[270]:
array([[['aaa', 'aab', 'aac', 'aad'],
['aba', 'abb', 'abc', 'abd'],
['aca', 'acb', 'acc', 'acd'],
['ada', 'adb', 'adc', 'add']],
[['baa', 'bab', 'bac', 'bad'],
['bba', 'bbb', 'bbc', 'bbd'],
['bca', 'bcb', 'bcc', 'bcd'],
['bda', 'bdb', 'bdc', 'bdd']],
[['caa', 'cab', 'cac', 'cad'],
['cba', 'cbb', 'cbc', 'cbd'],
['cca', 'ccb', 'ccc', 'ccd'],
['cda', 'cdb', 'cdc', 'cdd']],
[['daa', 'dab', 'dac', 'dad'],
['dba', 'dbb', 'dbc', 'dbd'],
['dca', 'dcb', 'dcc', 'dcd'],
['dda', 'ddb', 'ddc', 'ddd']]],
dtype='|S3')
With this definition, np.array(foo(A,foo(A,A))).reshape(4,4,4) produces the same array.
In [285]: A3.reshape(8,8)
Out[285]:
array([['aaa', 'aab', 'aac', 'aad', 'aba', 'abb', 'abc', 'abd'],
['aca', 'acb', 'acc', 'acd', 'ada', 'adb', 'adc', 'add'],
['baa', 'bab', 'bac', 'bad', 'bba', 'bbb', 'bbc', 'bbd'],
['bca', 'bcb', 'bcc', 'bcd', 'bda', 'bdb', 'bdc', 'bdd'],
['caa', 'cab', 'cac', 'cad', 'cba', 'cbb', 'cbc', 'cbd'],
['cca', 'ccb', 'ccc', 'ccd', 'cda', 'cdb', 'cdc', 'cdd'],
['daa', 'dab', 'dac', 'dad', 'dba', 'dbb', 'dbc', 'dbd'],
['dca', 'dcb', 'dcc', 'dcd', 'dda', 'ddb', 'ddc', 'ddd']],
dtype='|S3')
Could it be that you want the Kronecker product of two char.arrays?
A quick adaptation of np.kron (numpy/lib/shape_base.py):
def outer(a,b):
# custom 'outer' for this issue
# a,b must be np.char.array for '+' to be defined
return a.ravel()[:, np.newaxis]+b.ravel()[np.newaxis,:]
def kron(a,b):
# assume a,b are 2d char array
# functionally same as np.kron, but using custom outer()
result = outer(a, b).reshape(a.shape+b.shape)
result = np.hstack(np.hstack(result))
result = np.char.array(result)
return result
A = np.char.array(list('abcd')).reshape(2,2)
produces:
A =>
[['a' 'b']
['c' 'd']]
outer(A,A) =>
[['aa' 'ab' 'ac' 'ad']
['ba' 'bb' 'bc' 'bd']
['ca' 'cb' 'cc' 'cd']
['da' 'db' 'dc' 'dd']]
kron(A,A) =>
[['aa' 'ab' 'ba' 'bb']
['ac' 'ad' 'bc' 'bd']
['ca' 'cb' 'da' 'db']
['cc' 'cd' 'dc' 'dd']]
kron rearranges the outer elements by reshaping it to (2,2,2,2), and then concatenating twice on axis=1.
kron(kron(A,A),A) =>
[['aaa' 'aab' 'aba' 'abb' 'baa' 'bab' 'bba' 'bbb']
['aac' 'aad' 'abc' 'abd' 'bac' 'bad' 'bbc' 'bbd']
['aca' 'acb' 'ada' 'adb' 'bca' 'bcb' 'bda' 'bdb']
['acc' 'acd' 'adc' 'add' 'bcc' 'bcd' 'bdc' 'bdd']
['caa' 'cab' 'cba' 'cbb' 'daa' 'dab' 'dba' 'dbb']
['cac' 'cad' 'cbc' 'cbd' 'dac' 'dad' 'dbc' 'dbd']
['cca' 'ccb' 'cda' 'cdb' 'dca' 'dcb' 'dda' 'ddb']
['ccc' 'ccd' 'cdc' 'cdd' 'dcc' 'dcd' 'ddc' 'ddd']]
kron(kron(kron(A,A),A),A) =>
# (16,16)
[['aaaa' 'aaab' 'aaba' 'aabb'...]
['aaac' 'aaad' 'aabc' 'aabd'...]
['aaca' 'aacb' 'aada' 'aadb'...]
['aacc' 'aacd' 'aadc' 'aadd'...]
...]
I don't know of a better way to word what I'm looking for, so please bear with me.
Let's say that I have a list of 17 elements. For the sake of brevity we'll represent this list as ABCDEFGHIJKLMNOPQ. If I wanted to divide this into 7 sufficiently "even" sub-lists, it might look like this:
ABC DE FGH IJ KL MNO PQ
Here, the lengths of each sub-list are 3, 2, 3, 2, 2, 3, 2. The maximum length is only one more than the minimum length: ABC DE FGH I JKL MN OPQ has seven sub-lists as well, but the range of lengths is two here.
Furthermore, examine how many 2's separate each pair of 3's: this follows the same rule of RANGE ≤ 1. The range of lengths in ABC DEF GH IJ KLM NO PQ is 1 as well, but they are imbalanced: 3, 3, 2, 2, 3, 2, 2. Ideally, if one were to keep reducing the sub-list in such a fashion, the numbers would never deviate from one another by more than one.
Of course, there is more than one way to "evenly" divide a list into sub-lists in this fashion. I'm not looking for an exhaustive set of solutions - if I can get one solution in Python for a list of any length and any number of sub-lists, that's good enough for me. The problem is that I don't even know where to begin when solving such a problem. Does anyone know what I'm looking for?
>>> s='ABCDEFGHIJKLMNOPQ'
>>> parts=7
>>> [s[i*len(s)//parts:(i+1)*len(s)//parts] for i in range(parts)]
['AB', 'CD', 'EFG', 'HI', 'JKL', 'MN', 'OPQ']
>>> import string
>>> for j in range(26):
... print [string.uppercase[i*j//parts:(i+1)*j//parts] for i in range(parts)]
...
['', '', '', '', '', '', '']
['', '', '', '', '', '', 'A']
['', '', '', 'A', '', '', 'B']
['', '', 'A', '', 'B', '', 'C']
['', 'A', '', 'B', '', 'C', 'D']
['', 'A', 'B', '', 'C', 'D', 'E']
['', 'A', 'B', 'C', 'D', 'E', 'F']
['A', 'B', 'C', 'D', 'E', 'F', 'G']
['A', 'B', 'C', 'D', 'E', 'F', 'GH']
['A', 'B', 'C', 'DE', 'F', 'G', 'HI']
['A', 'B', 'CD', 'E', 'FG', 'H', 'IJ']
['A', 'BC', 'D', 'EF', 'G', 'HI', 'JK']
['A', 'BC', 'DE', 'F', 'GH', 'IJ', 'KL']
['A', 'BC', 'DE', 'FG', 'HI', 'JK', 'LM']
['AB', 'CD', 'EF', 'GH', 'IJ', 'KL', 'MN']
['AB', 'CD', 'EF', 'GH', 'IJ', 'KL', 'MNO']
['AB', 'CD', 'EF', 'GHI', 'JK', 'LM', 'NOP']
['AB', 'CD', 'EFG', 'HI', 'JKL', 'MN', 'OPQ']
['AB', 'CDE', 'FG', 'HIJ', 'KL', 'MNO', 'PQR']
['AB', 'CDE', 'FGH', 'IJ', 'KLM', 'NOP', 'QRS']
['AB', 'CDE', 'FGH', 'IJK', 'LMN', 'OPQ', 'RST']
['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU']
['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STUV']
['ABC', 'DEF', 'GHI', 'JKLM', 'NOP', 'QRS', 'TUVW']
['ABC', 'DEF', 'GHIJ', 'KLM', 'NOPQ', 'RST', 'UVWX']
['ABC', 'DEFG', 'HIJ', 'KLMN', 'OPQ', 'RSTU', 'VWXY']
If you have a list of length N, and you want some number of sub-lists S, it seems to me that you should start with a division with remainder. For N == 17 and S == 7, you have 17 // 7 == 2 and 17 % 7 == 3. So you can start with 7 length values of 2, but know that you need to increment 3 of the length values by 1 to handle the remainder. Since your list of length values is length 7, and you have 3 values to increment, you could compute X = 7 / 3 and use that as a stride: increment the 0th item, then the int(X) item, the int(2*X) item, and so on.
If that doesn't work for you, I suggest you get a book called The Algorithm Design Manual by Skiena, and look through the set and tree algorithms.
http://www.algorist.com/
See the "grouper" example at http://docs.python.org/library/itertools.html