Regular expression in python - help needed - python

Like many other people posting questions here, I recently started programming in Python.
I'm faced with a problem trying to define the regular expression to extract a variable name (I have a list of variable names saved in the list) from a string.
I am parsing part of the code which I take line by line from a file.
I make a list of variables,:
>>> variable_list = ['var1', 'var2', 'var4_more', 'var3', 'var1_more']
What I want to do is to define re.compile with something that won't say that it found two var1; I want to make an exact match. According to the example above, var should match nothing, var1 should match only the first element of the list.
I presume that the answer may be combining regex with negation of other regex, but I am not sure how to solve this problem.
OK, I have noticed that I missed one important thing. Variable list is gathered from a string, so it's possible to have a space before the var name, or sign after.
More accurate variable_list would be something like
>>> variable_list = [' var1;', 'var1 ;', 'var1)', 'var1_more']
In this case it should recognize first 3, but not the last one as a var1.

It sounds like you just need to anchor your regex with ^ and $, unless I'm not understanding you properly:
>>> mylist = ['var1', 'var2', 'var3_something', 'var1_text', 'var1var1']
>>> import re
>>> r = re.compile(r'^var1$')
>>> matches = [item for item in mylist if r.match(item)]
>>> print matches
['var1']
So ^var1$ will match exactly var1, but not var1_text or var1var1. Is that what you're after?
I suppose one way to handle your edit would be with ^\W*var1\W*$ (where var1 is the variable name you want). The \W shorthand character class matches anything that is not in the \w class, and \w in Python is basically alphanumeric characters plus the underscore. The * means that this may be matched zero or more times. This results in:
variable_list = [' var1;', 'var1 ;', 'var1)', 'var1_more']
>>> r = re.compile(r'^\W*var1\W*$')
>>> matches = [item for item in variable_list if r.match(item)]
>>> print matches
[' var1;', 'var1 ;', 'var1)']
If you want the name of the variable without the extraneous stuff then you can capture it and extract the first capture group. Something like this, maybe (probably a bit inefficient since the regex runs twice on matched items):
>>> r = re.compile(r'^\W*(var1)\W*$')
>>> matches = [r.match(item).group(1) for item in variable_list if r.match(item)]
>>> print matches
['var1', 'var1', 'var1']

If you are trying to learn about regular expressions, then maybe this is a useful puzzle, but if you want to see whether a certain word is in a list of words why not this:
>>> 'var1' in mylist
True
>>> 'var1 ' in mylist
False

Not to expand too much more on the regex match, but you might consider using the 'filter()' builtin:
filter(function, iterable)
So, using one of the regex's suggested by #eldarerathis:
>>> mylist = ['var1', 'var2', 'var3_something', 'var1_text', 'var1var1']
>>> import re
>>> r = re.compile(r'^var1$')
>>> matches = filter(r.match, mylist)
['var1']
Or using your own match function:
>>> def matcher(value):
>>> ... match statement ...
>>> filter(matcher, mylist)
['var1']
Or negate the regex earlier with a lambda:
>>> filter(lambda x: not r.match(x), mylist)
['var2', 'var3_something', 'var1_text', 'var1var1']

Related

Regex to match digits joined by a character and their units [duplicate]

I want to match different parts of a string and store them in separate variables for later use. For example,
string = "bunch(oranges, bananas, apples)"
rxp = "[a-z]*\([var1]\, [var2]\, [var3]\)"
so that I have
var1 = "oranges"
var2 = "bananas"
var3 = "apples"
Something like what re.search() does but for multiple different parts of the same match.
EDIT: the number of fruits in the list is not known beforehand. Should have put this in with the question.
That is what re.search does. Just use capturing groups (parentheses) to access the stuff that was matched by certain subpatterns later on:
>>> import re
>>> m = re.search(r"[a-z]*\(([a-z]*), ([a-z]*), ([a-z]*)\)", string)
>>> m.group(0)
'bunch(oranges, bananas, apples)'
>>> m.group(1)
'oranges'
>>> m.group(2)
'bananas'
>>> m.group(3)
'apples'
Also note, that I used a raw string to avoid the double backslashes.
If your number of "variables" inside bunch can vary, you have a problem. Most regex engines cannot capture a variable number of strings. However in that case you could get away with this:
>>> m = re.search(r"[a-z]*\(([a-z, ]*)\)", string)
>>> m.group(1)
'oranges, bananas, apples'
>>> m.group(1).split(', ')
['oranges', 'bananas', 'apples']
For regular expressions, you can use the match() function to do what you want, and use groups to get your results. Also, don't assign to the word string, as that is a built-in function (even though it's deprecated). For your example, if you know there are always the same number of fruits each time, it looks like this:
import re
input = "bunch(oranges, bananas, apples)"
var1, var2, var3 = re.match('bunch\((\w+), (\w+), (\w+)\)', input).group(1, 2, 3)
Here, I used the \w special sequence, which matches any alphanumeric character or underscore, as explained in the documentation
If you don't know the number of fruits in advance, you can use two regular expression calls, one to get extract the minimal part of the string where the fruits are listed, getting rid of "bunch" and the parentheses, then finditer to extract the names of the fruits:
import re
input = "bunch(oranges, bananas, apples)"
[m.group(0) for m in re.finditer('\w+(, )?', re.match('bunch\(([^)]*)\)', input).group(1))]
If you want, you can use groupdict to store matching items in a dictionary:
regex = re.compile("[a-z]*\((?P<var1>.*)\, (?P<var2>.*)\, (?P<var3>.*)")
match = regex.match("bunch(oranges, bananas, apples)")
if match:
match.groupdict()
#{'var1': 'oranges', 'var2': 'bananas', 'var3': 'apples)'}
Don't. Every time you use var1, var2 etc, you actually want a list. Unfortunately, this is no way to collect arbitrary number of subgroups in a list using findall, but you can use a hack like this:
import re
lst = []
re.sub(r'([a-z]+)(?=[^()]*\))', lambda m: lst.append(m.group(1)), string)
print lst # ['oranges', 'bananas', 'apples']
Note that this works not only for this specific example, but also for any number of substrings.

Complex regex in Python

I am trying to write a generic pattern using regex so that it fetches only particular things from the string. Let's say we have strings like GigabitEthernet0/0/0/0 or FastEthernet0/4 or Ethernet0/0.222. The regex should fetch the first 2 characters and all the numerals. Therefore, the fetched result should be something like Gi0000 or Fa04 or Et00222 depending on the above cases.
x = 'GigabitEthernet0/0/0/2
m = re.search('([\w+]{2}?)[\\\.(\d+)]{0,}',x)
I am not able to understand how shall I write the regular expression. The values can be fetched in the form of a list also. I write few more patterns but it isn't helping.
In regex, you may use re.findall function.
>>> import re
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join(re.findall(r'\d', s))
'Gi0000'
OR
>>> ''.join(re.findall(r'^..|\d', s))
'Gi0000'
>>> ''.join(re.findall(r'^..|\d', 'Ethernet0/0.222'))
'Et00222'
OR
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join([i for i in s if i.isdigit()])
'Gi0000'
z="Ethernet0/0.222."
print z[:2]+"".join(re.findall(r"(\d+)(?=[\d\W]*$)",z))
You can try this.This will make sure only digits from end come into play .
Here is another option:
s = 'Ethernet0/0.222'
"".join(re.findall('^\w{2}|[\d]+', s))

How to capture multiple repeating patterns with regular expression?

I get some string like this: \input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}
I would like to capture all the paths: path1, path2, ... pathn. I tried the re module in python. However, it does not support multiple capture.
For example: r"\\mypath\{(\{[^\{\}\[\]]*\})*\}" will only return the last matched group. Applying the pattern to search(r"\mypath{{path1}{path2}})" will only return groups() as ("{path2}",)
Then I found an alternative way to do this:
gpathRegexPat=r"(?:\\mypath\{)((\{[^\{\}\[\]]*\})*)(?:\})"
gpathRegexCp=re.compile(gpathRegexPat)
strpath=gpathRegexCp.search(r'\mypath{{sadf}{ad}}').groups()[0]
>>> strpath
'{sadf}{ad}'
p=re.compile('\{([^\{\}\[\]]*)\}')
>>> p.findall(strpath)
['sadf', 'ad']
or:
>>> gpathRegexPat=r"\\mypath\{(\{[^{}[\]]*\})*\}"
>>> gpathRegexCp=re.compile(gpathRegexPat, flags=re.I|re.U)
>>> strpath=gpathRegexCp.search(r'\input{{whatever]{1}}\mypath{{sadf}{ad}}\shape{{0.2}{0.1}}').group()
>>> strpath
'\\mypath{{sadf}{ad}}'
>>> p.findall(strpath)
['sadf', 'ad']
At this point, I thought, why not just use the findall on the original string? I may use:
gpathRegexPat=r"(?:\\mypath\{)(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?(?:\})": if the first (?:\{[^\{\}\[\]]*\})*? matches 0 time and the 2nd (?:\{[^\{\}\[\]]*\})*? matches 1 time, it will capture sadf; if the first (?:\{[^\{\}\[\]]*\})*? matches 1 time, the 2nd one matches 0 time, it will capture ad. However, it will only return ['sadf'] with this regex.
With out all those extra patterns ((?:\\mypath\{) and (?:\})), it actually works:
>>> p2=re.compile(r'(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?')
>>> p2.findall(strpath)
['sadf', 'ad']
>>> p2.findall('{adadd}{dfada}{adafadf}')
['adadd', 'dfada', 'adafadf']
Can anyone explain this behavior to me? Is there any smarter way to achieve the result I want?
re.findall("{([^{}]+)}",text)
should work
returns
['path1', 'path2', 'path3', 'pathn']
finally
my_path = r"\input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}"
#get the \mypath part
my_path2 = [p for p in my_path.split("\\") if p.startswith("mypath")][0]
print re.findall("{([^{}]+)}",my_path2)
or even better
re.findall("{(path\d+)}",text) #will only return things like path<num> inside {}
You are right. It is not possible to return repeated subgroups inside a group. To do what you want, you can use a regular expression to capture the group and then use a second regular expression to capture the repeated subgroups.
In this case that would be something like: \\mypath{(?:\{.*?\})}. This will return {path1}{path2}{path3}
Then to find the repeating patterns of {pathn} inside that string, you can simply use \{(.*?)\}. This will match anything withing the braces. The .*? is a non-greedy version of .*, meaning it will return the shortest possible match instead of the longest possible match.

python regular expression to find something in between two strings or phrases

How can I use regex in python to capture something between two strings or phrases, and removing everything else on the line?
For example, the following is a protein sequence preceded by a one-line header. How can I sift off "CG33289-PC" from the header below based on the stipulation that is occurs after the phrase "FlyBase_Annotation_IDs:" and before the next comma "," ?
I need to substitute the header with this simplified result "CG33289-PC" and not destroy the protein sequence (found below the header-line in all caps).
This is what each protein sequence entry looks like - a header followed by a sequence:
>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel;
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
This is the desired output:
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
Using regexps:
>>> s = """>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel; MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV"""
>>> import re
>>> print re.sub(r'.*FlyBase_Annotation_IDs:([\w-]+).*;', r'\1\n', s)
CG33289-PC
MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
>>>
Not an elegant solution, but this should work for you:
>>> fly = 'FlyBase_Annotation_IDs'
>>> repl = 'CG33289-PC'
>>> part1, part2 = protein.split(fly)
>>> part2 = part2.replace(repl, "FooBar")
>>> protein = fly.join([part1, part2])
assuming FlyBase_Annotation_IDs can only appear once in the data.
I'm not sure about the format of the file, but this regex will capture the data in your example:
"FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);"
Use findall function to get the match.
Assuming there is a newline after the header:
>>> import re
>>> protein = "..."
>>> r = re.compile(r"^.*FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);.*$", re.MULTILINE)
>>> r.sub(r"\1", protein)
The group ([A-Z0-9a-z-]*) in the regular expression extracts any alphanumeric character and the dash. If ids can have other characters, just add them.

How to split a string by using [] in Python

So from this string:
"name[id]"
I need this:
"id"
I used str.split ('[]'), but it didn't work. Does it only take a single delimiter?
Use a regular expression:
import re
s = "name[id]"
re.find(r"\[(.*?)\]", s).group(1) # = 'id'
str.split() takes a string on which to split input. For instance:
"i,split,on commas".split(',') # = ['i', 'split', 'on commas']
The re module also allows you to split by regular expression, which can be very useful, and I think is what you meant to do.
import re
s = "name[id]"
# split by either a '[' or a ']'
re.split('\[|\]', s) # = ['name', 'id', '']
Either
"name[id]".split('[')[1][:-1] == "id"
or
"name[id]".split('[')[1].split(']')[0] == "id"
or
re.search(r'\[(.*?)\]',"name[id]").group(1) == "id"
or
re.split(r'[\[\]]',"name[id]")[1] == "id"
Yes, the delimiter is the whole string argument passed to split. So your example would only split a string like 'name[]id[]'.
Try eg. something like:
'name[id]'.split('[', 1)[-1].split(']', 1)[0]
'name[id]'.split('[', 1)[-1].rstrip(']')
I'm not a fan of regex, but in cases like it often provides the best solution.
Triptych already recommended this, but I'd like to point out that the ?P<> group assignment can be used to assign a match to a dictionary key:
>>> m = re.match(r'.*\[(?P<id>\w+)\]', 'name[id]')
>>> result_dict = m.groupdict()
>>> result_dict
{'id': 'id'}
>>>
You don't actually need regular expressions for this. The .index() function and string slicing will work fine.
Say we have:
>>> s = 'name[id]'
Then:
>>> s[s.index('[')+1:s.index(']')]
'id'
To me, this is easy to read: "start one character after the [ and finish before the ]".
def between_brackets(text):
return text.partition('[')[2].partition(']')[0]
This will also work even if your string does not contain a […] construct, and it assumes an implied ] at the end in the case you have only a [ somewhere in the string.
I'm new to python and this is an old question, but maybe this?
str.split('[')[1].strip(']')
You can get the value of the list use []. For example, create a list from URL like below with split.
>>> urls = 'http://quotes.toscrape.com/page/1/'
This generates a list like the one below.
>>> print( urls.split("/") )
['http:', '', 'quotes.toscrape.com', 'page', '11', '']
And what if you wanna get value only "http" from this list? You can use like this
>>> print(urls.split("/")[0])
http:
Or what if you wanna get value only "1" from this list? You can use like this
>>> print(urls.split("/")[-2])
1
str.split uses the entire parameter to split a string. Try:
str.split("[")[1].split("]")[0]

Categories

Resources