At the moment I have a string and I want to extract the contents of the parenthesis.
This is the string:
>>>string = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
This is the regular expression I am using and it yields the following:
>>>regex_output = re.findall(r'\((\d{3,4})m|([\d.:]+\d)\)',string)
>>>regex_output
[('600', ''), ('', '36.57')]
As I understand, the empty strings are caused by nesting capturing groups in my regex.
All I want is a list of two variables as:
['600','36.57']
I could make my new list from my current output but that would defeat the purpose of using the regular expression. So is there a way of achieving my desired output by modifying my regex. Thanks
>>> import re
>>> s = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
You can search for the enclosing ( and )
>>> re.search('\((.*?)\)',s).group(1)
'600m 36.57'
Then split on the 'm ' characters
>>> re.search('\((.*?)\)',s).group(1).split('m ')
['600', '36.57']
You could try the below code also which uses positive look-behind to match the number which was just after to ( and also it uses lookahead to match the decimal number which was just before to ),
>>> import re
>>> s = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
>>> m = re.findall(r'(?<=\()\d+|\d+[.:]\d+(?=\))', s, re.M)
>>> m
['600', '36.57']
Related
For example, I have a string:
The struct-of-application and struct-of-world
With re.sub, it will replace the matched with a predefined string. How can I replace the match with a transformation of the matched content? To get, for example:
The [application_of_struct](http://application_of_struct) and [world-of-struct](http://world-of-struct)
If I write a simple regex ((\w+-)+\w+) and try to use re.sub, it seems I can't use what I matched as part of the replacement, let alone edit the matched content:
In [10]: p.sub('struct','The struct-of-application and struct-of-world')
Out[10]: 'The struct and struct'
Use a function for the replacement
s = 'The struct-of-application and struct-of-world'
p = re.compile('((\w+-)+\w+)')
def replace(match):
return 'http://{}'.format(match.group())
#for python 3.6+ ...
#return f'http://{match.group()}'
>>> p.sub(replace, s)
'The http://struct-of-application and http://struct-of-world'
>>>
Try this:
>>> p = re.compile(r"((\w+-)+\w+)")
>>> p.sub('[\\1](http://\\1)','The struct-of-application and struct-of-world')
'The [struct-of-application](http://struct-of-application) and [struct-of-world](http://struct-of-world)'
I am trying to split a string such as: add(ten)sub(one) into add(ten) sub(one).
I can't figure out how to match the close parentheses. I have used re.sub(r'\\)', '\\) ') and every variation of escaping the parentheses,I can think of. It is hard to tell in this font but I am trying to add a space between these commands so I can split it into a list later.
There's no need to escape ) in the replacement string, ) has a special a special meaning only in the regex pattern so it needs to be escaped there in order to match it in the string, but in normal string it can be used as is.
>>> strs = "add(ten)sub(one)"
>>> re.sub(r'\)(?=\S)',r') ', strs)
'add(ten) sub(one)'
As #StevenRumbalski pointed out in comments the above operation can be simply done using str.replace and str.rstrip:
>>> strs.replace(')',') ').strip()
'add(ten) sub(one)'
d = ')'
my_str = 'add(ten)sub(one)'
result = [t+d for t in my_str.split(d) if len(t) > 0]
result = ['add(ten)','sub(one)']
Create a list of all substrings
import re
a = 'add(ten)sub(one)'
print [ b for b in re.findall('(.+?\(.+?\))', a) ]
Output:
['add(ten)', 'sub(one)']
How can I use regex in python to capture something between two strings or phrases, and removing everything else on the line?
For example, the following is a protein sequence preceded by a one-line header. How can I sift off "CG33289-PC" from the header below based on the stipulation that is occurs after the phrase "FlyBase_Annotation_IDs:" and before the next comma "," ?
I need to substitute the header with this simplified result "CG33289-PC" and not destroy the protein sequence (found below the header-line in all caps).
This is what each protein sequence entry looks like - a header followed by a sequence:
>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel;
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
This is the desired output:
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
Using regexps:
>>> s = """>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel; MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV"""
>>> import re
>>> print re.sub(r'.*FlyBase_Annotation_IDs:([\w-]+).*;', r'\1\n', s)
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
>>>
Not an elegant solution, but this should work for you:
>>> fly = 'FlyBase_Annotation_IDs'
>>> repl = 'CG33289-PC'
>>> part1, part2 = protein.split(fly)
>>> part2 = part2.replace(repl, "FooBar")
>>> protein = fly.join([part1, part2])
assuming FlyBase_Annotation_IDs can only appear once in the data.
I'm not sure about the format of the file, but this regex will capture the data in your example:
"FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);"
Use findall function to get the match.
Assuming there is a newline after the header:
>>> import re
>>> protein = "..."
>>> r = re.compile(r"^.*FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);.*$", re.MULTILINE)
>>> r.sub(r"\1", protein)
The group ([A-Z0-9a-z-]*) in the regular expression extracts any alphanumeric character and the dash. If ids can have other characters, just add them.
Like many other people posting questions here, I recently started programming in Python.
I'm faced with a problem trying to define the regular expression to extract a variable name (I have a list of variable names saved in the list) from a string.
I am parsing part of the code which I take line by line from a file.
I make a list of variables,:
>>> variable_list = ['var1', 'var2', 'var4_more', 'var3', 'var1_more']
What I want to do is to define re.compile with something that won't say that it found two var1; I want to make an exact match. According to the example above, var should match nothing, var1 should match only the first element of the list.
I presume that the answer may be combining regex with negation of other regex, but I am not sure how to solve this problem.
OK, I have noticed that I missed one important thing. Variable list is gathered from a string, so it's possible to have a space before the var name, or sign after.
More accurate variable_list would be something like
>>> variable_list = [' var1;', 'var1 ;', 'var1)', 'var1_more']
In this case it should recognize first 3, but not the last one as a var1.
It sounds like you just need to anchor your regex with ^ and $, unless I'm not understanding you properly:
>>> mylist = ['var1', 'var2', 'var3_something', 'var1_text', 'var1var1']
>>> import re
>>> r = re.compile(r'^var1$')
>>> matches = [item for item in mylist if r.match(item)]
>>> print matches
['var1']
So ^var1$ will match exactly var1, but not var1_text or var1var1. Is that what you're after?
I suppose one way to handle your edit would be with ^\W*var1\W*$ (where var1 is the variable name you want). The \W shorthand character class matches anything that is not in the \w class, and \w in Python is basically alphanumeric characters plus the underscore. The * means that this may be matched zero or more times. This results in:
variable_list = [' var1;', 'var1 ;', 'var1)', 'var1_more']
>>> r = re.compile(r'^\W*var1\W*$')
>>> matches = [item for item in variable_list if r.match(item)]
>>> print matches
[' var1;', 'var1 ;', 'var1)']
If you want the name of the variable without the extraneous stuff then you can capture it and extract the first capture group. Something like this, maybe (probably a bit inefficient since the regex runs twice on matched items):
>>> r = re.compile(r'^\W*(var1)\W*$')
>>> matches = [r.match(item).group(1) for item in variable_list if r.match(item)]
>>> print matches
['var1', 'var1', 'var1']
If you are trying to learn about regular expressions, then maybe this is a useful puzzle, but if you want to see whether a certain word is in a list of words why not this:
>>> 'var1' in mylist
True
>>> 'var1 ' in mylist
False
Not to expand too much more on the regex match, but you might consider using the 'filter()' builtin:
filter(function, iterable)
So, using one of the regex's suggested by #eldarerathis:
>>> mylist = ['var1', 'var2', 'var3_something', 'var1_text', 'var1var1']
>>> import re
>>> r = re.compile(r'^var1$')
>>> matches = filter(r.match, mylist)
['var1']
Or using your own match function:
>>> def matcher(value):
>>> ... match statement ...
>>> filter(matcher, mylist)
['var1']
Or negate the regex earlier with a lambda:
>>> filter(lambda x: not r.match(x), mylist)
['var2', 'var3_something', 'var1_text', 'var1var1']
Is there a way to determine how many capture groups there are in a given regular expression?
I would like to be able to do the follwing:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * num_of_groups(regexp)
This allows me to do stuff like:
first, last, phone = groups(r'(\w+) (\w+) ([\d\-]+)', 'John Doe 555-3456')
However, I don't know how to implement num_of_groups. (Currently I just work around it.)
EDIT: Following the advice from rslite, I replaced re.findall with re.search.
sre_parse seems like the most robust and comprehensive solution, but requires tree traversal and appears to be a bit heavy.
MizardX's regular expression seems to cover all bases, so I'm going to go with that.
def num_groups(regex):
return re.compile(regex).groups
f_x = re.search(...)
len_groups = len(f_x.groups())
Something from inside sre_parse might help.
At first glance, maybe something along the lines of:
>>> import sre_parse
>>> sre_parse.parse('(\d)\d(\d)')
[('subpattern', (1, [('in', [('category', 'category_digit')])])),
('in', [('category', 'category_digit')]),
('subpattern', (2, [('in', [('category', 'category_digit')])]))]
I.e. count the items of type 'subpattern':
import sre_parse
def count_patterns(regex):
"""
>>> count_patterns('foo: \d')
0
>>> count_patterns('foo: (\d)')
1
>>> count_patterns('foo: (\d(\s))')
1
"""
parsed = sre_parse.parse(regex)
return len([token for token in parsed if token[0] == 'subpattern'])
Note that we're only counting root level patterns here, so the last example only returns 1. To change this, tokens would need to searched recursively.
First of all if you only need the first result of re.findall it's better to just use re.search that returns a match or None.
For the groups number you could count the number of open parenthesis '(' except those that are escaped by '\'. You could use another regex for that:
def num_of_groups(regexp):
rg = re.compile(r'(?<!\\)\(')
return len(rg.findall(regexp))
Note that this doesn't work if the regex contains non-capturing groups and also if '(' is escaped by using it as '[(]'. So this is not very reliable. But depending on the regexes that you use it might help.
Using your code as a basis:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * len(m.groups())
Might be wrong, but I don't think there is a way to find the number of groups that would have been returned had the regex matched. The only way I can think of to make this work the way you want it to is to pass the number of matches your particular regex expects as an argument.
To clarify though: When findall succeeds, you only want the first match to be returned, but when it fails you want a list of empty strings? Because the comment seems to show all matches being returned as a list.