While compiling the following code i am not getting an syntax error but not all results. The point of the program is to check a string sequence, find some specific substrings in it and print a resulting string having the substring and 19 characters following it. Print each time those strings occurs and every resulting string.
here is the code..
x=raw_input('GET STRING:: ');
m=len(x);
k=0;
while(k<m):
if('AAT'in x or 'AAC' in x or 'AAG' in x):
start = x.find('AAT') or x.find('AAC') or x.find('AAG')
end=start+19
print x[start:end]
When I'm inputting a string like ATGGAATCTTGTGATTGCATTGACACGCCATGCCCTGGTGAAGAACTCTTAGTGAAATATCAGTATATCT. It only searches for AAT and prints the resulting substring but not AAG and AAC. Can anyone help me implement the operator???
In your example, it's probably better to use a regular expression.
>>> text = 'ATGGAATCTTGTGATTGCATTGACACGCCATGCCCTGGTGAAGAACTCTTAGTGAAATATCAGTATATCT'
>>> re.search('(?:AA[TCG])(.{19})', text).group(1)
'CTTGTGATTGCATTGACAC'
You could change to re.findall if multiple matches are desired from the string. (But this won't work too well if you want over lapping matches (ie, your string of 3 appears again in the 19).
search for the first occurrence starting from k
mystring=raw_input('GET STRING:: ')
m=len(mystring)
k=0
while(k<m):
x=mystring[k:]
start=min(x.find('AAT'),x.find('AAC'),x.find('AAG'))
end=min(start+19,m)
print x[start:end]
k+=start+1
You should set start to the minimum non-negative value of the three find statements.
You can handle overlapping matches with regular expressions that use lookahead assertions together with a capturing group:
>>> import re
>>> regex = re.compile("(?=(AA[TCG].{19}))")
>>> regex.findall("ATGGAATCTTGTGATTGCATTGACACGCCATGCCCTGGTGAAGAACTCTTAGTGAAATATCAGTATATCT")
['AATCTTGTGATTGCATTGACAC', 'AAGAACTCTTAGTGAAATATCA', 'AACTCTTAGTGAAATATCAGTA']
>>>
How about this:
import re
str= "ATGGAATCTTGTGATTGCATTGACACGCCATGCCCTGGTGAAGAACTCTTAGTGAAATATCAGTATATCT"
alist = ['AAT','AAC','AAG']
newlist= [re.findall(e,str) for e in alist]
Output: [['AAT','AAT'],['AAC'],['AAG']].
Here a bit heavier with indexes:
import re
astr= "ATGGAATCTTGTGATTGCATTGACACGCCATGCCCTGGTGAAGAACTCTTAGTGAAATATCAGTATATCT"
def find_triple_base(astr, nth_sub):
return [(m.end(), m.group(), astr[m.end(0):m.end(0)+nth_sub]) for m in re.finditer(r'AA[TCG]', astr)]
for e in find_triple_base(astr, 19): print(e)
Output:
(7, 'AAT', 'CTTGTGATTGCATTGACAC')
(43, 'AAG', 'AACTCTTAGTGAAATATCA')
(46, 'AAC', 'TCTTAGTGAAATATCAGTA')
(58, 'AAT', 'ATCAGTATATCT')
What it does: findall finds all occurences of your base triples (alist) you'd like to find and generates a new list with 3 lists with base triples eg [['AAT','AAT'],['AAC'],['AAG']]. It's straight forward to print this out.
I hope this helps!
Have a look on this : http://ideone.com/U70n4y
Code:
x=raw_input('GET STRING:: ');
m=len(x);
k=0
if('AAT'in x ):
start = x.find('AAT')
end=start+19
print x[start:end]
elif('AAC' in x ):
start = x.find('AAC')
end=start+19
print x[start:end]
elif('AAG' in x):
start = x.find('AAG')
end=start+19
print x[start:end]
Edit : try this regexp code
import re
y=r"(?:AA[TCG]).{19}"
x=raw_input('GET STRING:: ');
l= re.findall(y,x)
for x in l:
print x
print len(x)
http://ideone.com/U70n4y
Related
I have following string
'100000|^104,500|^^0^0^0^0^0^0^0|^^^^^^^^^412824|103000|103000|103000|103000^^'
How to sum last 5 integers after |^^^^^^^^^ till ^^ separated by |.
I tried re.split('[|^^^^^^^^^]', string) but it splits using |^ delimiter and returns list.
import re
string = '100000|^104,500|^^0^0^0^0^0^0^0|^^^^^^^^^412824|103000|103000|103000|103000^^'
answer = sum(map(int, re.search(r'\^{9}(.+)\^\^', string).group(1).split('|')))
answer:
824824
Using re.search #Lookbehind & Lookahead
Demo:
import re
s = '100000|^104,500|^^0^0^0^0^0^0^0|^^^^^^^^^412824|103000|103000|103000|103000^^'
d = re.search(r"(?<=\^{9}).*?(?=\^{2})", s)
if d:
print( sum(map(int, d.group().split("|"))) )
Output:
824824
Those characters are special in regular expressions and need to be escaped. Try with this:
import re
s = '100000|^104,500|^^0^0^0^0^0^0^0|^^^^^^^^^412824|103000|103000|103000|103000^^'
nums = re.split(r'\|\^{9}', s)[1]
# Find all integers and sum
total = sum(map(int, re.findall(r'\d+', nums)))
print(total)
Output:
824824
you can try this (but without re library )
a='100000|^104,500|^^0^0^0^0^0^0^0|^^^^^^^^^412824|103000|103000|103000|103000^^'
a=a.split('^'*9)
a=(a[1]).replace('^^','')
a=a.split('|')
s = 0
for i in a:
s += int(i)
print(s)
A fully-regex solution could use this regex:
.+\|\^{9}|[\^\|]+
You can split using this regex. The resulting array will contain some empty elements, however, you can easily check for them while adding.
with findall() and negative lookahead:
sum( int(i) for i in re.findall(r"(?!.*\^{9})\d+",s))
I have a string of characters with no specific pattern. I have to look for some specific words and then extract some information.
Currently I am stuck at finding the position of the last number in a string.
So, for example if:
mystring="The total income from company xy was 12320 for the last year and 11932 in the previous year"
I want to find out the position of the last number in this string.
So the result should be "2" in position "70".
You can do this with a regular expression, here's a quick attempt:
>>>mo = re.match('.+([0-9])[^0-9]*$', mystring)
>>>print mo.group(1), mo.start(1)
2 69
This is a 0-based position, of course.
You can use a generator expression to loop over the enumerate from trailing within a next function:
>>> next(i for i,j in list(enumerate(mystring,1))[::-1] if j.isdigit())
70
Or using regex :
>>> import re
>>>
>>> m=re.search(r'(\d)[^\d]*$',mystring)
>>> m.start()+1
70
Save all the digits from the string in an array and pop the last one out of it.
array = [int(s) for s in mystring.split() if s.isdigit()]
lastdigit = array.pop()
It is faster than a regex approach and looks more readable than it.
def find_last(s):
temp = list(enumerate(s))
temp.reverse()
for pos, chr in temp:
try:
return(pos, int(chr))
except ValueError:
continue
You could reverse the string and get the first match with a simple regex:
s = mystring[::-1]
m = re.search('\d', s)
pos = len(s) - m.start(0)
Can't get a regular expression to replace a character on odd repeated occurrences in Python.
Example:
char = ``...```.....``...`....`````...`
to
``...``````.....``...``....``````````...``
on even occurrences doesn't replace.
for example:
>>> import re
>>> s = "`...```.....``...`....`````...`"
>>> re.sub(r'((?<!`)(``)*`(?!`))', r'\1\1', s)
'``...``````.....``...``....``````````...``'
Maybe I'm old fashioned (or my regex skills aren't up to par), but this seems to be a lot easier to read:
import re
def double_odd(regex,string):
"""
Look for groups that match the regex. Double every second one.
"""
count = [0]
def _double(match):
count[0] += 1
return match.group(0) if count[0]%2 == 0 else match.group(0)*2
return re.sub(regex,_double,string)
s = "`...```.....``...`....`````...`"
print double_odd('`',s)
print double_odd('`+',s)
It seems that I might have been a little confused about what you were actually looking for. Based on the comments, this becomes even easier:
def odd_repl(match):
"""
double a match (all of the matched text) when the length of the
matched text is odd
"""
g = match.group(0)
return g*2 if len(g)%2 == 1 else g
re.sub(regex,odd_repl,your_string)
This may be not as good as the regex solution, but works:
In [101]: s1=re.findall(r'`{1,}',char)
In [102]: s2=re.findall(r'\.{1,}',char)
In [103]: fill=s1[-1] if len(s1[-1])%2==0 else s1[-1]*2
In [104]: "".join("".join((x if len(x)%2==0 else x*2,y)) for x,y in zip(s1,s2))+fill
Out[104]: '``...``````.....``...``....``````````...``'
Assume I have a string as follows: expression = '123 + 321'.
I am walking over the string character-by-character as follows: for p in expression. I am I am checking if p is a digit using p.isdigit(). If p is a digit, I'd like to grab the whole number (so grab 123 and 321, not just p which initially would be 1).
How can I do that in Python?
In C (coming from a C background), the equivalent would be:
int x = 0;
sscanf(p, "%d", &x);
// the full number is now in x
EDIT:
Basically, I am accepting a mathematical expression from a user that accepts positive integers, +,-,*,/ as well as brackets: '(' and ')'. I am walking the string character by character and I need to be able to determine whether the character is a digit or not. Using isdigit(), I can that. If it is a digit however, I need to grab the whole number. How can that be done?
>>> from itertools import groupby
>>> expression = '123 + 321'
>>> expression = ''.join(expression.split()) # strip whitespace
>>> for k, g in groupby(expression, str.isdigit):
if k: # it's a digit
print 'digit'
print list(g)
else:
print 'non-digit'
print list(g)
digit
['1', '2', '3']
non-digit
['+']
digit
['3', '2', '1']
This is one of those problems that can be approached from many different directions. Here's what I think is an elegant solution based on itertools.takewhile:
>>> from itertools import chain, takewhile
>>> def get_numbers(s):
... s = iter(s)
... for c in s:
... if c.isdigit():
... yield ''.join(chain(c, takewhile(str.isdigit, s)))
...
>>> list(get_numbers('123 + 456'))
['123', '456']
This even works inside a list comprehension:
>>> def get_numbers(s):
... s = iter(s)
... return [''.join(chain(c, takewhile(str.isdigit, s)))
... for c in s if c.isdigit()]
...
>>> get_numbers('123 + 456')
['123', '456']
Looking over other answers, I see that this is not dissimilar to jamylak's groupby solution. I would recommend that if you don't want to discard the extra symbols. But if you do want to discard them, I think this is a bit simpler.
The Python documentation includes a section on simulating scanf, which gives you some idea of how you can use regular expressions to simulate the behavior of scanf (or sscanf, it's all the same in Python). In particular, r'\-?\d+' is the Python string that corresponds to the regular expression for an integer. (r'\d+' for a nonnegative integer.) So you could embed this in your loop as
integer = re.compile(r'\-?\d+')
for p in expression:
if p.isdigit():
# somehow find the current position in the string
integer.match(expression, curpos)
But that still reflects a very C-like way of thinking. In Python, your iterator variable p is really just an individual character that has actually been pulled out of the original string and is standing on its own. So in the loop, you don't naturally have access to the current position within the string, and trying to calculate it is going to be less than optimal.
What I'd suggest instead is using Python's built in regexp matching iteration method:
integer = re.compile(r'\-?\d+') # only do this once in your program
all_the_numbers = integer.findall(expression)
and now all_the_numbers is a list of string representations of all the integers in the expression. If you wanted to actually convert them to integers, then you could do this instead of the last line:
all_the_numbers = [int(s) for s in integer.finditer(expression)]
Here I've used finditer instead of findall because you don't have to make a list of all the strings before iterating over them again to convert them to integers.
Though I'm not familiar with sscanf, I'm no C developer, it looks like it's using format strings in a way not dissimilar to what I'd use python's re module for. Something like this:
import re
nums = re.compile('\d+')
found = nums.findall('123 + 321')
# if you know you're only looking for two values.
left, right = found
You can use shlex http://docs.python.org/library/shlex.html
>>> from shlex import shlex
>>> expression = '123 + 321'
>>> for e in shlex(expression):
... print e
...
123
+
321
>>> expression = '(92831 * 948) / 32'
>>> for e in shlex(expression):
... print e
...
(
92831
*
948
)
/
32
I'd split the string up on the ' + ' string, giving you what's outside of them:
>>> expression = '123 + 321'
>>> ex = expression.split(' + ')
>>> ex
['123', '321']
>>> int_ex = map(int, ex)
>>> int_ex
[123, 321]
>>> sum(int_ex)
444
It's dangerous, but you could use eval:
>>> eval('123 + 321')
444
I'm just taking a stab at you parsing the string, and doing raw calculations on it.
e_array = expression.split('+')
i_array = map(int, e_array)
And i_array holds all integers in the expression.
UPDATE
If you already know all the special characters in your expression and you want to eliminate them all
import re
e_array = re.split('[*/+\-() ]', expression) # all characters here is mult, div, plus, minus, left- right- parathesis and space
i_array = map(int, filter(lambda x: len(x), e_array))
I have a string s with nested brackets: s = "AX(p>q)&E((-p)Ur)"
I want to remove all characters between all pairs of brackets and store in a new string like this: new_string = AX&E
i tried doing this:
p = re.compile("\(.*?\)", re.DOTALL)
new_string = p.sub("", s)
It gives output: AX&EUr)
Is there any way to correct this, rather than iterating each element in the string?
Another simple option is removing the innermost parentheses at every stage, until there are no more parentheses:
p = re.compile("\([^()]*\)")
count = 1
while count:
s, count = p.subn("", s)
Working example: http://ideone.com/WicDK
You can just use string manipulation without regular expression
>>> s = "AX(p>q)&E(qUr)"
>>> [ i.split("(")[0] for i in s.split(")") ]
['AX', '&E', '']
I leave it to you to join the strings up.
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> re.compile("""\([^\)]*\)""").sub('', s)
'AX&E'
Yeah, it should be:
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> p = re.compile("\(.*?\)", re.DOTALL)
>>> new_string = p.sub("", s)
>>> new_string
'AX&E'
Nested brackets (or tags, ...) are something that are not possible to handle in a general way using regex. See http://www.amazon.de/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&s=gateway&qid=1304230523&sr=8-1-spell for details why. You would need a real parser.
It's possible to construct a regex which can handle two levels of nesting, but they are already ugly, three levels will already be quite long. And you don't want to think about four levels. ;-)
You can use PyParsing to parse the string:
from pyparsing import nestedExpr
import sys
s = "AX(p>q)&E((-p)Ur)"
expr = nestedExpr('(', ')')
result = expr.parseString('(' + s + ')').asList()[0]
s = ''.join(filter(lambda x: isinstance(x, str), result))
print(s)
Most code is from: How can a recursive regexp be implemented in python?
You could use re.subn():
import re
s = 'AX(p>q)&E((-p)Ur)'
while True:
s, n = re.subn(r'\([^)(]*\)', '', s)
if n == 0:
break
print(s)
Output
AX&E
this is just how you do it:
# strings
# double and single quotes use in Python
"hey there! welcome to CIP"
'hey there! welcome to CIP'
"you'll understand python"
'i said, "python is awesome!"'
'i can\'t live without python'
# use of 'r' before string
print(r"\new code", "\n")
first = "code in"
last = "python"
first + last #concatenation
# slicing of strings
user = "code in python!"
print(user)
print(user[5]) # print an element
print(user[-3]) # print an element from rear end
print(user[2:6]) # slicing the string
print(user[:6])
print(user[2:])
print(len(user)) # length of the string
print(user.upper()) # convert to uppercase
print(user.lstrip())
print(user.rstrip())
print(max(user)) # max alphabet from user string
print(min(user)) # min alphabet from user string
print(user.join([1,2,3,4]))
input()