I want to match different parts of a string and store them in separate variables for later use. For example,
string = "bunch(oranges, bananas, apples)"
rxp = "[a-z]*\([var1]\, [var2]\, [var3]\)"
so that I have
var1 = "oranges"
var2 = "bananas"
var3 = "apples"
Something like what re.search() does but for multiple different parts of the same match.
EDIT: the number of fruits in the list is not known beforehand. Should have put this in with the question.
That is what re.search does. Just use capturing groups (parentheses) to access the stuff that was matched by certain subpatterns later on:
>>> import re
>>> m = re.search(r"[a-z]*\(([a-z]*), ([a-z]*), ([a-z]*)\)", string)
>>> m.group(0)
'bunch(oranges, bananas, apples)'
>>> m.group(1)
'oranges'
>>> m.group(2)
'bananas'
>>> m.group(3)
'apples'
Also note, that I used a raw string to avoid the double backslashes.
If your number of "variables" inside bunch can vary, you have a problem. Most regex engines cannot capture a variable number of strings. However in that case you could get away with this:
>>> m = re.search(r"[a-z]*\(([a-z, ]*)\)", string)
>>> m.group(1)
'oranges, bananas, apples'
>>> m.group(1).split(', ')
['oranges', 'bananas', 'apples']
For regular expressions, you can use the match() function to do what you want, and use groups to get your results. Also, don't assign to the word string, as that is a built-in function (even though it's deprecated). For your example, if you know there are always the same number of fruits each time, it looks like this:
import re
input = "bunch(oranges, bananas, apples)"
var1, var2, var3 = re.match('bunch\((\w+), (\w+), (\w+)\)', input).group(1, 2, 3)
Here, I used the \w special sequence, which matches any alphanumeric character or underscore, as explained in the documentation
If you don't know the number of fruits in advance, you can use two regular expression calls, one to get extract the minimal part of the string where the fruits are listed, getting rid of "bunch" and the parentheses, then finditer to extract the names of the fruits:
import re
input = "bunch(oranges, bananas, apples)"
[m.group(0) for m in re.finditer('\w+(, )?', re.match('bunch\(([^)]*)\)', input).group(1))]
If you want, you can use groupdict to store matching items in a dictionary:
regex = re.compile("[a-z]*\((?P<var1>.*)\, (?P<var2>.*)\, (?P<var3>.*)")
match = regex.match("bunch(oranges, bananas, apples)")
if match:
match.groupdict()
#{'var1': 'oranges', 'var2': 'bananas', 'var3': 'apples)'}
Don't. Every time you use var1, var2 etc, you actually want a list. Unfortunately, this is no way to collect arbitrary number of subgroups in a list using findall, but you can use a hack like this:
import re
lst = []
re.sub(r'([a-z]+)(?=[^()]*\))', lambda m: lst.append(m.group(1)), string)
print lst # ['oranges', 'bananas', 'apples']
Note that this works not only for this specific example, but also for any number of substrings.
Related
I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31
I'm trying to write a filter in django that highlights words based on a search query. For example, if my string contains this is a sample string that I want to highlight using my filter and my search stubs are sam and ring, my desired output would be:
this is a <mark>sam</mark>ple st<mark>ring</mark> that I want to highlight using my filter
I'm using the answer from here to replace multiple words. I've presented the code below:
import re
words = search_stubs.split()
rep = dict((re.escape(k), '<mark>%s</mark>'%(k)) for k in words)
pattern = re.compile('|'.join(rep.keys()))
text = pattern.sub(lambda m : rep[re.escape(m.group(0))], text_to_replace)
However, when there is case sensitivity, this breaks. For example, if I have the string Check highlight function, and my search stub contains check, this breaks.
The desired output in this case would naturally be:
<mark>Check</mark> highlight function
You don't need to go for dictionary here. (?i) called case-insensitive modifier helps to do a case-insensitive match.
>>> s = "this is a sample string that I want to highlight using my filter"
>>> l = ['sam', 'ring']
>>> re.sub('(?i)(' + '|'.join(map(re.escape, l)) + ')', r'<mark>\1</mark>', s)
'this is a <mark>sam</mark>ple st<mark>ring</mark> that I want to highlight using my filter'
EXample 2:
>>> s = 'Check highlight function'
>>> l = ['check']
>>> re.sub('(?i)(' + '|'.join(map(re.escape, l)) + ')', r'<mark>\1</mark>', s)
'<mark>Check</mark> highlight function'
The simple way to do this is to not try to build a dict mapping every single word to its marked-up equivalent, and just use a capturing group and a reference to it. Then you can just use the IGNORECASE flag to do a case-insensitive search.
pattern = re.compile('({})'.format('|'.join(map(re.escape, words))),
re.IGNORECASE)
text = pattern.sub(r'<mark>\1</mark>', text_to_replace)
For example, if text_to_replace were:
I am Sam. Sam I am. I will not eat green eggs and spam.
… then text will be:
I am <mark>Sam</mark>. <mark>Sam</mark> I am. I will not eat green eggs and spam
If you really did want to do it your way, you could. For example:
text = pattern.sub(lambda m: rep[re.escape(m.group(0))].replace(m, m.group(0)),
text_to_replace)
But that would be kind of silly. You're building a dict with 'sam' embedded in the value, just so you can replace that 'sam' with the 'Sam' that you actually matched.
See Grouping in the Regular Expression HOWTO for more on groups and references, and the re.sub docs for specifics on using references in substitutions.
At the moment I have a string and I want to extract the contents of the parenthesis.
This is the string:
>>>string = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
This is the regular expression I am using and it yields the following:
>>>regex_output = re.findall(r'\((\d{3,4})m|([\d.:]+\d)\)',string)
>>>regex_output
[('600', ''), ('', '36.57')]
As I understand, the empty strings are caused by nesting capturing groups in my regex.
All I want is a list of two variables as:
['600','36.57']
I could make my new list from my current output but that would defeat the purpose of using the regular expression. So is there a way of achieving my desired output by modifying my regex. Thanks
>>> import re
>>> s = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
You can search for the enclosing ( and )
>>> re.search('\((.*?)\)',s).group(1)
'600m 36.57'
Then split on the 'm ' characters
>>> re.search('\((.*?)\)',s).group(1).split('m ')
['600', '36.57']
You could try the below code also which uses positive look-behind to match the number which was just after to ( and also it uses lookahead to match the decimal number which was just before to ),
>>> import re
>>> s = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
>>> m = re.findall(r'(?<=\()\d+|\d+[.:]\d+(?=\))', s, re.M)
>>> m
['600', '36.57']
I get some string like this: \input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}
I would like to capture all the paths: path1, path2, ... pathn. I tried the re module in python. However, it does not support multiple capture.
For example: r"\\mypath\{(\{[^\{\}\[\]]*\})*\}" will only return the last matched group. Applying the pattern to search(r"\mypath{{path1}{path2}})" will only return groups() as ("{path2}",)
Then I found an alternative way to do this:
gpathRegexPat=r"(?:\\mypath\{)((\{[^\{\}\[\]]*\})*)(?:\})"
gpathRegexCp=re.compile(gpathRegexPat)
strpath=gpathRegexCp.search(r'\mypath{{sadf}{ad}}').groups()[0]
>>> strpath
'{sadf}{ad}'
p=re.compile('\{([^\{\}\[\]]*)\}')
>>> p.findall(strpath)
['sadf', 'ad']
or:
>>> gpathRegexPat=r"\\mypath\{(\{[^{}[\]]*\})*\}"
>>> gpathRegexCp=re.compile(gpathRegexPat, flags=re.I|re.U)
>>> strpath=gpathRegexCp.search(r'\input{{whatever]{1}}\mypath{{sadf}{ad}}\shape{{0.2}{0.1}}').group()
>>> strpath
'\\mypath{{sadf}{ad}}'
>>> p.findall(strpath)
['sadf', 'ad']
At this point, I thought, why not just use the findall on the original string? I may use:
gpathRegexPat=r"(?:\\mypath\{)(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?(?:\})": if the first (?:\{[^\{\}\[\]]*\})*? matches 0 time and the 2nd (?:\{[^\{\}\[\]]*\})*? matches 1 time, it will capture sadf; if the first (?:\{[^\{\}\[\]]*\})*? matches 1 time, the 2nd one matches 0 time, it will capture ad. However, it will only return ['sadf'] with this regex.
With out all those extra patterns ((?:\\mypath\{) and (?:\})), it actually works:
>>> p2=re.compile(r'(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?')
>>> p2.findall(strpath)
['sadf', 'ad']
>>> p2.findall('{adadd}{dfada}{adafadf}')
['adadd', 'dfada', 'adafadf']
Can anyone explain this behavior to me? Is there any smarter way to achieve the result I want?
re.findall("{([^{}]+)}",text)
should work
returns
['path1', 'path2', 'path3', 'pathn']
finally
my_path = r"\input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}"
#get the \mypath part
my_path2 = [p for p in my_path.split("\\") if p.startswith("mypath")][0]
print re.findall("{([^{}]+)}",my_path2)
or even better
re.findall("{(path\d+)}",text) #will only return things like path<num> inside {}
You are right. It is not possible to return repeated subgroups inside a group. To do what you want, you can use a regular expression to capture the group and then use a second regular expression to capture the repeated subgroups.
In this case that would be something like: \\mypath{(?:\{.*?\})}. This will return {path1}{path2}{path3}
Then to find the repeating patterns of {pathn} inside that string, you can simply use \{(.*?)\}. This will match anything withing the braces. The .*? is a non-greedy version of .*, meaning it will return the shortest possible match instead of the longest possible match.
Like many other people posting questions here, I recently started programming in Python.
I'm faced with a problem trying to define the regular expression to extract a variable name (I have a list of variable names saved in the list) from a string.
I am parsing part of the code which I take line by line from a file.
I make a list of variables,:
>>> variable_list = ['var1', 'var2', 'var4_more', 'var3', 'var1_more']
What I want to do is to define re.compile with something that won't say that it found two var1; I want to make an exact match. According to the example above, var should match nothing, var1 should match only the first element of the list.
I presume that the answer may be combining regex with negation of other regex, but I am not sure how to solve this problem.
OK, I have noticed that I missed one important thing. Variable list is gathered from a string, so it's possible to have a space before the var name, or sign after.
More accurate variable_list would be something like
>>> variable_list = [' var1;', 'var1 ;', 'var1)', 'var1_more']
In this case it should recognize first 3, but not the last one as a var1.
It sounds like you just need to anchor your regex with ^ and $, unless I'm not understanding you properly:
>>> mylist = ['var1', 'var2', 'var3_something', 'var1_text', 'var1var1']
>>> import re
>>> r = re.compile(r'^var1$')
>>> matches = [item for item in mylist if r.match(item)]
>>> print matches
['var1']
So ^var1$ will match exactly var1, but not var1_text or var1var1. Is that what you're after?
I suppose one way to handle your edit would be with ^\W*var1\W*$ (where var1 is the variable name you want). The \W shorthand character class matches anything that is not in the \w class, and \w in Python is basically alphanumeric characters plus the underscore. The * means that this may be matched zero or more times. This results in:
variable_list = [' var1;', 'var1 ;', 'var1)', 'var1_more']
>>> r = re.compile(r'^\W*var1\W*$')
>>> matches = [item for item in variable_list if r.match(item)]
>>> print matches
[' var1;', 'var1 ;', 'var1)']
If you want the name of the variable without the extraneous stuff then you can capture it and extract the first capture group. Something like this, maybe (probably a bit inefficient since the regex runs twice on matched items):
>>> r = re.compile(r'^\W*(var1)\W*$')
>>> matches = [r.match(item).group(1) for item in variable_list if r.match(item)]
>>> print matches
['var1', 'var1', 'var1']
If you are trying to learn about regular expressions, then maybe this is a useful puzzle, but if you want to see whether a certain word is in a list of words why not this:
>>> 'var1' in mylist
True
>>> 'var1 ' in mylist
False
Not to expand too much more on the regex match, but you might consider using the 'filter()' builtin:
filter(function, iterable)
So, using one of the regex's suggested by #eldarerathis:
>>> mylist = ['var1', 'var2', 'var3_something', 'var1_text', 'var1var1']
>>> import re
>>> r = re.compile(r'^var1$')
>>> matches = filter(r.match, mylist)
['var1']
Or using your own match function:
>>> def matcher(value):
>>> ... match statement ...
>>> filter(matcher, mylist)
['var1']
Or negate the regex earlier with a lambda:
>>> filter(lambda x: not r.match(x), mylist)
['var2', 'var3_something', 'var1_text', 'var1var1']