I have a text file containing entries similar to the following example:
# 8 rows of header
---------------------------------------------
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
etc...
What I would like to achieve is:
Convert the As into 1s, the Bs into 0s in order to have a binary number
For the example above this would be 110100 (i.e. AABABB)
Convert this binary number into a decimal number
For the example above this would then be 52
Map this decimal number to a text string
(i.e. 52 = "Case 1" or 53 = "Case 2" etc.) and
Print this on the stdout
I have a little bit of Python experience but the problem above is way beyond my capabilities. Therefore any help from the community would be appreciated.
Many thanks in advance,
Hib
A few pointers (assuming Python 2):
Translating a string:
>>> import string
>>> table = string.maketrans("AB","10")
>>> translated = "AABABB".translate(table)
>>> translated
'110100'
Converting to base 10:
>>> int(translated, 2)
52
No idea how you would map that to those arbitrary strings - more information needed.
Printing to stdout - really? Which part are you having trouble with?
Something like this should work (not tested):
from itertools import islice
binary_map = dict(zip("AB", "10")) # Equivalent to {"A": "1", "B": "0"}
string_map = {52: "Case 1", 53: "Case 2"}
with open("my_text_file") as f:
binary_str = "".join(binary_map[x.split()[2]] for x in islice(f, 9, None))
binary_value = int(binary_string, 2)
print string_map[binary_value]
I'll break down the indented code line for you and explain it.
The join method of an empty string will concatenate the strings given in the argument, so "".join(["A", "B", "C"]) is equal to "ABC".
We pass this method a so-called generator expression, X for Y in Z. It has the same syntax as a list comprehension, except the square brackets are omitted.
The islice function returns an iterator that silently skips the first 9 lines of the file object f, so it yields lines starting with the 10th.
The split method of str with no arguments will split on any sequence of whitespace characters (space, tab ("\t"), linefeed ("\n") and carriage return ("\r")) and return a list. So for example, " a \t b\n\t c\n".split() is equal to ['a', 'b', 'c']. We're interested in the third column, x.split()[2], which is either "A" or "B".
Looking up this value in the binary_map dictionary will give us either "1" or "0" instead.
a.txt:
# 8 rows of header
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
you can try this:
>>> int(''.join([line.split(' ')[2] for line in open('a.txt', 'r').readlines()[8:]]).replace('A', '1').replace('B', '0'), 2)
>>> 52
As for mapping the int to a string, not sure what you mean.
>>> value = {int(''.join([line.split(' ')[2] for line in open('a.txt', 'r').readlines()[8:]]).replace('A', '1').replace('B', '0'), 2): 'case 52'}
>>> value[52]
'case 52'
>>>
I used re module in order to check the format of the lines to be accepted:
>>> def map_file_to_string(string):
values = []
for line in string.split('\n'):
if re.match(r'\d{3} \w{3}\d{5} [AB] .*', line):
values.append(1 if line[13] == 'A' else 0)
return dict_map[int(''.join(map(str, values)), 2)]
>>> dict_map = {52: 'Case 1', 53: 'Case 2'}
>>> s1 = """# 8 rows of header
---------------------------------------------
123 ABC12345 A some more variable length text
456 DEF12345 A some more variable length text
789 GHI12345 B some more variable length text
987 JKL12345 A some more variable length text
654 MNO12345 B some more variable length text
321 PQR12345 B some more variable length text
etc.."""
>>> map_file_to_string(s1)
'Case 1'
>>>
Related
I got the information of a txt file and store it as lines
print(lines)
['>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79\n', 'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC\n', 'AGGACAGGCCGCTAAAGTG\n', '>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80\n', 'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG\n', 'GCCTGGTAACACGTGCCAGC\n']
If you execute the code
for i in lines:
print(i)
You get:
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTG
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGC
I want to store the sequences that are in caps TTGGTTTCGTGGTTT... as independent elements in an object so you can operate with them, so you would be able to do something like:
seq[1]
>>> TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTG
gattaca = [x.strip() for x in lines if x.isupper()]
>>> gattaca
['TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC',
'AGGACAGGCCGCTAAAGTG',
'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG',
'GCCTGGTAACACGTGCCAGC']
You can do this:
lines = list(map(str.strip, (filter(str.isupper, lines))))
So if you use isupper() you can check if your string in the list is upper case. If True, it means it is.
for i in lines:
if i.isupper():
## store the string
To check wheter a string is caps I woult use mySting == mySting.upper().
To get all caps elements you could use a list comprehension like so:
result = [s for s in lines if lines == lines.upper()]
This would still allow special characters in your string.
If you only want uppercase leters then use lines.isalpha().
result = [s for s in lines if lines == lines.upper() and lines.isalpha()]
I would use a regex:
import re
seq={}
pattern=r'^(>.*$)\n([ACGTU\n]*?)(?=^>|\Z)'
for i,m in enumerate(re.finditer(pattern, ''.join(lines), flags=re.M)):
seq[i]=m.group(2).replace('\n','')
Then each FASTA seq is mapped to an integer:
>>> seq
{0: 'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTG', 1: 'CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCGGCCTGGTAACACGTGCCAGC'}
I've been searching for a simpler way to do this, but i'm not sure what search parameters to use. I have a floating point number, that i would like to round, convert to a string, then specify a custom format on the string. I've read through the .format docs, but can't see if it's possible to do this using normal string formatting.
The output i want is just a normal string, with spaces every three chars, except for the last ones, which should have a space four chars before the end.
For example, i made this convoluted function that does what i want in an inefficient way:
def my_formatter(value):
final = []
# round float and convert to list of strings of individual chars
c = [i for i in '{:.0f}'.format(value)]
if len(c) > 3:
final.append(''.join(c[-4:]))
c = c[:-4]
else:
return ''.join(c)
for i in range(0, len(c) // 3 + 1, 1):
if len(c) > 2:
final.insert(0, ''.join(c[-3:]))
c = c[:-3]
elif len(c) > 0:
final.insert(0, ''.join(c))
return(' '.join(final))
e.g.
>>> my_formatter(123456789.12)
>>> '12 345 6789'
>>> my_formatter(12345678912.34)
>>> '1 234 567 8912'
Would really appreciate guidance on doing this in a simpler / more efficient way.
Took a slightly different angle but this uses a third party function partition_all. In short, I use it to group the string into groups of 3 plus the final group if there are less than 3 chars. You may prefer this as there are no for loops or conditionals but it's basically cosmetic differences.
from toolz.itertoolz import partition_all
def simpleformat(x):
x = str(round(x))
a, b = x[:-4], x[-4:]
strings = [''.join(x[::-1]) for x in reversed(list(partition_all(3, a[::-1])))]
return ' '.join(strings + [b])
Try this:
def my_formatter(x):
# round it as text
txt = "{:.0f}".format(x)
# find split indices
splits = [None] + list(range(-4, -len(txt), -3)) + [None]
# slice and rejoin
return " ".join(
reversed([txt[i:j] for i, j in zip(splits[1:], splits[:-1])]))
Then
>>> my_formatter(123456789.1)
12 345 6789
>>> my_formatter(1123456789.1)
112 345 6789
>>> my_formatter(11123456789.1)
1 112 345 6789
Here is a pretty simple solution using a loop over the elements in the reverse order such that counting the indices is easier:
num = 12345678912.34
temp = []
for ix, c in enumerate(reversed(str(round(num)))):
if ix%3 == 0 and ix !=0: temp.extend([c, ' '])
else: temp.extend(c)
''.join(list(reversed(temp)))
Output:
'1 234 567 8912'
Using list comprehensions we can do this in a single very confusing line as
num = 12345678912.34
''.join(list(reversed(list(''.join([c+' ' if(ix%3 == 0 and ix!=0) else c for ix, c in enumerate(reversed(str(round(num))))])))))
'1 234 567 8912'
Another approch is to use locale if available on your system of course, and use format.
import locale
for v in ('fr_FR.UTF-8', 'en_GB.UTF-8'):
locale.setlocale(locale.LC_NUMERIC, v)
print(v, '>> {:n}'.format(111222333999))
May as well share another slightly different variant, but still can't shake the feeling that there's some sublime way that we just can't see. Haven't marked any answers as correct yet, because i'm convinced python can do this in a simpler way, somehow. What's also driving me crazy is that if i remember correctly VB's format command can handle this (with a pattern like "### ####0"). Maybe it's just a case of not understanding how to use Python's .format correctly.
The below accepts a float or decimal and a list indicating split positions. If there are still digits in the string after consuming the last split position, it re-applies that until it reaches the start of the string.
def format_number(num, pat, sep=' '):
fmt = []
strn = "{:.0f}".format(num)
while strn:
p = pat.pop() if pat else p
fmt.append(strn[-p:])
strn = strn[:-p] if len(strn) > p else ''
return sep.join(fmt[::-1])
>>> format_number(123456789, [3, 4])
>>> '12 345 6789'
>>> format_number(1234567890, [3])
>>> '1 234 567 890'
The following regex does exactly what I want it to do, except that it also outputs the index as a digit ( I think it's the index). This messes up my output. So how can I tell it not to take the index ?
import re
import pandas as pd
df = pd.read_excel("tstfile.xlsx", names=["col1"])
for index, row in df.iterrows():
# print(index)
if str(row[0]).split():
if not re.findall("(.[A-Z]\d+\-\d+)", str(row)):
for i in re.findall("(\d+)", str(row)):
print(i)
Input data would look like:
123, 456
111 * 222
LL123-456
35
I get an output that looks like this:
123
0
456
1
111
2
222
3
35
4
The final desired output should be:
123
456
111
222
35
So only the data that is actually given in as input.
You can change your code like this:
for row in df.values.astype(str):
for word in row:
if not re.findall("(.[A-Z]\d+\-\d+)", word):
for num in re.findall("(\d+)", word):
print(num)
Alternatively, here is a one liner that converts the dataframe values into a string and uses the re.findall method to extract the numbers as strings. Words that start with upper case letters and contain a minus sign are excluded.
all_numbers = re.findall(r'(\d+)', ' '.join([j for i in df.values.astype(str) for j in i if not re.search(r'[A-Z].+\-', j)]))
for item in all_numbers:
print(item)
If you need integer numbers instead of strings, you can convert the list into a generator with
all_integers = map(int, all_numbers)
for i in all_integers:
print(i)
But remember, that generators can only be used once.
You can try this:
>>> data = """123, 456
... 111 * 222
... LL123-456
... 35"""
>>> data = data.replace(',', '')
>>> data = data.split()
>>> x = [int(i) for i in data if i.isdigit()]
>>> print(x)
The output is
[123, 456, 111, 222, 35]
I have this string:
'fhsdkfhskdslshsdkhlghs
bksjvsfgsdnfsfbjfgzfga
avzaeafeaeaddacbytt!tw
fhsdkfhskdslshsdkhlghs
bksjvsfgsdnfsfbjfgzfga
avzaeafeaeaddacbytt!tw'
And I want to use this part of code for cut it in pieces of length 22:
from textwrap import wrap
w_str= (wrap(str,22))
And then I will got this:
fhsdkfhskdslshsdkhlghs
bksjvsfgsdnfsfbjfgzfga
avzaeafeaeaddacbytt!tw
The next step should take the last four letters and of the first string and past it at the beginning of the next and so on.
Just like this with an Id:
e_1
fhsdkfhskdslshsdkhlghs
bksjvsfgsdnfsfbjfgzfgaavza
e_2
avzaeafeaeaddacbytt!tw
fhsdkfhskdslshsdkhlghslghs
e_3
lghsbksjvsfgsdnfsfbjfgzfga
zfgaavzaeafeaeaddacbytt!tw
Once you have your string as such:
_str = """fhsdkfhskdslshsdkhlghs
bksjvsfgsdnfsfbjfgzfga
avzaeafeaeaddacbytt!tw"""
You can do the following:
>>> _str = _str.split()
>>> new = [_str[i-1][len(_str[i-1])-4:len(_str[i-1])]+_str[i] if i > 0 else _str[i] for i in range(len(_str))]
>>> print '\n'.join(new)
fhsdkfhskdslshsdkhlghs
lghsbksjvsfgsdnfsfbjfgzfga
zfgaavzaeafeaeaddacbytt!tw
>>>
Edit
zip two lists together in a list comprehension, as such:
'\n'.join(['\n'.join(item) for item in zip(['e_'+str(num) for num in range(1, len(new)+1)], new)])
>>> _str = _str.split()
>>> new = [_str[i-1][len(_str[i-1])-4:len(_str[i-1])]+_str[i] if i > 0 else _str[i] for i in range(len(_str))]
>>> print '\n'.join(['\n'.join(item) for item in zip(['e_'+str(num) for num in range(1, len(new)+1)], new)])
e_1
fhsdkfhskdslshsdkhlghs
e_2
lghsbksjvsfgsdnfsfbjfgzfga
e_3
zfgaavzaeafeaeaddacbytt!tw
>>>
In some ways, strings are like lists in Python in the way you can reference their contents by index, and splice them and so on.
So you could use the index of the characters in the string to pull out the last 4 characters of each wrapped string:
input_string = 'fhsdkfhskdslshsdkhlghsbksjvsfgsdnfsfbjfgzfgaavzaeafeaeaddacbytt!tw'
split_strings = wrap(input_string, 22)
add_string = '' # nothing there at first, but will be updated as we process each wrapped string
for output_string in split_strings:
print add_string + output_string
add_string = output_string[-4:] # "[-4:]" means: "from the fourth last char of the string, to the end"
outputs:
fhsdkfhskdslshsdkhlghs
lghsbksjvsfgsdnfsfbjfgzfga
zfgaavzaeafeaeaddacbytt!tw
How can I remove ids like '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n' from sequences?
I have this code:
with open('sequence.fasta', 'r') as f :
while True:
line1=f.readline()
line2=f.readline()
line3=f.readline()
if not line3:
break
fct([line1[i:i+100] for i in range(0, len(line1), 100)])
fct([line2[i:i+100] for i in range(0, len(line2), 100)])
fct([line3[i:i+100] for i in range(0, len(line3), 100)])
Output:
['>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n']
['CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n']
['AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG\n']
['CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA\n']
['AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA\n']
['ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT\n']
['AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA\n']
['GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC\n']
['AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT\n']
['TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT\n']
['GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT\n']
['GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC\n']
['\n']
...
My function is:
def fct(input_string):
code={"a":0,"c":1,"g":2,"t":3}
p=[code[i] for i in input_string]
n=len(input_string)
c=0
for i, n in enumerate(range(n, 0, -1)):
c +=p[i]*(4**(n-1))
return c+1
fct() returns an integer from a string. For example, ACT gives 8
i.e.: my function must take as input string sequences contain just the following bases A,C,G,T
But when I use my function it gives:
KeyError: '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n'
I try to remove ids by stripping lines start with > and writing the rest in text file so, my text file output.txt contains just sequences without ids, but when I use my function fct I found the same error:
KeyError: 'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n'
What can I do?
I see two major problems in your code: You're having problems parsing FASTA sequences, and your function is not properly iterating over each sequence.
Parsing FASTA data
Might I suggest using the excellent Biopython package? It has excellent FASTA support (reading and writing) built in (see Sequences in the Tutorial).
To parse sequences from a FASTA file:
for seq_record in SeqIO.parse("seqs.fasta", "fasta"):
print record.description # gi|2765658|emb|Z78533.1...
print record.seq # a Seq object, call str() to get a simple string
>>> print record.id
'gi|2765658|emb|Z78533.1|CIZ78533'
>>> print record.description
'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA'
>>> print record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
>>> print str(record.seq)
'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACC' #(truncated)
Iterating over sequence data
In your code, you have a list of strings being passed to fct() (input_string is not actually a string, but a list of strings). The solution is just to build one input string, and iterate over that.
Other errors in fct:
You need to capitalize the keys to your dictionary: case matters
You should have the return statement after the for loop. Keeping it nested means c is returned immediately.
Why bother constructing p when you can just index into code when iterating over the sequence?
You write over the sequence's length (n) by using it in your for loop as a variable name
Modified code (with proper PEP 8 formatting), and variables renamed to be clearer what they mean (still have no idea what c is supposed to be):
from Bio import SeqIO
def dna_seq_score(dna_seq):
nucleotide_code = {"A": 0, "C": 1, "G": 2, "T": 3}
c = 0
for i, k in enumerate(range(len(dna_seq), 0, -1)):
nucleotide = dna_seq[i]
code_num = nucleotide_code[nucleotide]
c += code_num * (4 ** (k - 1))
return c + 1
for record in SeqIO.parse("test.fasta", "fasta"):
dna_seq_score(record.seq)