search patterns with variable gaps in python - python

I am looking for patterns in a list containing different strings as:
names = ['TAATGH', 'GHHKLL', 'TGTHA', 'ATGTTKKKK', 'KLPPNF']
I would like to select the string that has the pattern 'T--T' (no matter how the string starts), so those elements would be selected and appended to a new list as:
namesSelected = ['TAATGH', 'ATGTTKKKK']
Using grep I could:
grep "T[[:alpha:]]\{2\}T"
Is there a similar mode in re python?
Thanks for any help!

I think this is most likely what you want:
re.search(r'T[A-Z]{2}T', inputString)
The equivalent in Python for [[:alpha:]] would be [a-zA-Z]. You may replace [A-Z] with [a-zA-Z] in the code snippet above if you wish to allow lowercase alphabet.
Documentation for re.search.

Yep, you can use re.search:
>>> names = ['TAATGH', 'GHHKLL', 'TGTHA', 'ATGTTKKKK', 'KLPPNF']
>>> reslist = []
>>> for i in names:
... res = re.search(r'T[A-Z]{2}T', i)
... if res:
... reslist.append(i)
...
>>>
>>> print(reslist)
['TAATGH', 'ATGTTKKKK']

import re
def grep(l, pattern):
r = re.compile(pattern)
return [_ for _ in l if r.search(pattern)]
nameSelected = grep(names, "T\w{2}T")
Note the use of \w instead of [[:alpha:]]

Related

Capture substring within a string - dynamically

I have a string:
ostring = "Ref('r1_featuring', ObjectId('5f475')"
What I am trying to do is search the string and check if it starts with Ref, if it does it should remove everything in the string and keep the substring 5f475.
I know this can be done using a simple replace like so:
string = ostring.replace("Ref('r1_featuring', ObjectId('", '').replace("')", '')
But I cannot do it this way as it needs to all be dynamic as there are going to be different strings each time. So I need to do it in a way that it will search the string and check if it starts with Ref, if it does then grab the alphanumeric value.
Desired Output:
5f475
Any help will be appreciated.
Like that?
>>> import re
>>> pattern = r"Ref.*'(.*)'\)$"
>>> m = re.match(pattern, "Ref('r1_featuring', ObjectId('5f475')")
>>> if m:
... print(m.group(1))
...
5f475
# >= python3.8
>>> if m := re.match(pattern, "Ref('r1_featuring', ObjectId('5f475')"):
... print(m.group(1))
...
5f475
a regex-free solution :)
ostring = "Ref('r1_featuring', ObjectId('5f475')"
if ostring.startswith("Ref"):
desired_part = ostring.rpartition("('")[-1].rpartition("')")[0]
str.rpartition

How to delete copies of character from string using regex?

I want to delete copies of 'i' in this example, tried using groups but it's not working. Where am I doing this wrong?
import re
a = '123iiii'
b = re.match('.*i(i+)', a)
print(b.group(1))
>>> i
a = re.sub(b.group(1), '', a)
print(a)
>>> 123
Desired result is '123i'.
Thanks for the answer.
Maybe,
([^i]*i)i*([^\r\n]*)
and a replacement of,
\1\2
might be OK to look into.
Test
import re
string = '''
123iiii
123iiiiabc
123i
'''
expression = r'([^i]*i)i*([^\r\n]*)'
print(re.sub(expression, r'\1\2', string))
Output
123i
123iabc
123i
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
It seems that what you need is:
import re
a = '123iiii'
a = re.sub(r"i+", "i", a)
print(a)
>>> 123i
You can achieve your goal by simply using the sub function to replace a sequence of i's with a single i
import re
a = '123iiii'
a = re.sub(r'i+', 'i', a)
print(a)
The following will work even if you have the i's in multiple places in the string.
import re
s = '123iiii657iii'
re.sub('i+','i',s)
Output:
'123i675i'

How to replace an re match with a transformation of that match?

For example, I have a string:
The struct-of-application and struct-of-world
With re.sub, it will replace the matched with a predefined string. How can I replace the match with a transformation of the matched content? To get, for example:
The [application_of_struct](http://application_of_struct) and [world-of-struct](http://world-of-struct)
If I write a simple regex ((\w+-)+\w+) and try to use re.sub, it seems I can't use what I matched as part of the replacement, let alone edit the matched content:
In [10]: p.sub('struct','The struct-of-application and struct-of-world')
Out[10]: 'The struct and struct'
Use a function for the replacement
s = 'The struct-of-application and struct-of-world'
p = re.compile('((\w+-)+\w+)')
def replace(match):
return 'http://{}'.format(match.group())
#for python 3.6+ ...
#return f'http://{match.group()}'
>>> p.sub(replace, s)
'The http://struct-of-application and http://struct-of-world'
>>>
Try this:
>>> p = re.compile(r"((\w+-)+\w+)")
>>> p.sub('[\\1](http://\\1)','The struct-of-application and struct-of-world')
'The [struct-of-application](http://struct-of-application) and [struct-of-world](http://struct-of-world)'

Regex match key except one

I have a python list. It contains strings like items[number].some field. I want get all this strings except strings that match items[<number>].classification. How I can do this by regex or maybe there is another way?
As an example, I have something like:
data.items.[0].deliveryAddress.region
data.items.[0].classification.scheme
data.items.[0].classification.id
data.items.[0].description
And I want to stay only with :
data.items.[0].description
data.items.[0].deliveryAddress.region
To do this, I used this regex to match the strings you want to discard:
data.items.\[\d+\].classification
Say I have a Python list containing those items called l:
l = ["data.items.[0].deliveryAddress.region",
"data.items.[0].classification.scheme",
"data.items.[0].classification.id",
"data.items.[0].description"]
I can then use a list comprehension to only keep the values that don't match the regex, by using re.match.
>>> import re
>>> [x for x in l if not re.match(r"data.items.\[\d+\].classification", x)]
['data.items.[0].deliveryAddress.region', 'data.items.[0].description']
You could go for a negative lookahead combined with anchors:
^((?:.(?!classification))+)$
In Python code this would be:
import re
string = """
data.items.[0].deliveryAddress.region
data.items.[0].classification.scheme
data.items.[0].classification.id
data.items.[0].description
"""
rx = re.compile(r'^((?:.(?!classification))+)$', re.MULTILINE)
matches = rx.findall(string)
print matches
# ['data.items.[0].deliveryAddress.region', 'data.items.[0].description']
Obviously, this will work with a list as well:
import re
lst = ['data.items.[0].deliveryAddress.region',
'data.items.[0].classification.scheme',
'data.items.[0].classification.id',
'data.items.[0].description']
# no need for re.MULTILINE here
rx = re.compile(r'^((?:.(?!classification))+)$')
matches = [x for x in lst if rx.match(x)]
print matches
# ['data.items.[0].deliveryAddress.region', 'data.items.[0].description']
See a demo on regex101.com.

python regular expression to find something in between two strings or phrases

How can I use regex in python to capture something between two strings or phrases, and removing everything else on the line?
For example, the following is a protein sequence preceded by a one-line header. How can I sift off "CG33289-PC" from the header below based on the stipulation that is occurs after the phrase "FlyBase_Annotation_IDs:" and before the next comma "," ?
I need to substitute the header with this simplified result "CG33289-PC" and not destroy the protein sequence (found below the header-line in all caps).
This is what each protein sequence entry looks like - a header followed by a sequence:
>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel;
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
This is the desired output:
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
Using regexps:
>>> s = """>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel; MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV"""
>>> import re
>>> print re.sub(r'.*FlyBase_Annotation_IDs:([\w-]+).*;', r'\1\n', s)
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
>>>
Not an elegant solution, but this should work for you:
>>> fly = 'FlyBase_Annotation_IDs'
>>> repl = 'CG33289-PC'
>>> part1, part2 = protein.split(fly)
>>> part2 = part2.replace(repl, "FooBar")
>>> protein = fly.join([part1, part2])
assuming FlyBase_Annotation_IDs can only appear once in the data.
I'm not sure about the format of the file, but this regex will capture the data in your example:
"FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);"
Use findall function to get the match.
Assuming there is a newline after the header:
>>> import re
>>> protein = "..."
>>> r = re.compile(r"^.*FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);.*$", re.MULTILINE)
>>> r.sub(r"\1", protein)
The group ([A-Z0-9a-z-]*) in the regular expression extracts any alphanumeric character and the dash. If ids can have other characters, just add them.

Categories

Resources