How to capture multiple repeating patterns with regular expression? - python

I get some string like this: \input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}
I would like to capture all the paths: path1, path2, ... pathn. I tried the re module in python. However, it does not support multiple capture.
For example: r"\\mypath\{(\{[^\{\}\[\]]*\})*\}" will only return the last matched group. Applying the pattern to search(r"\mypath{{path1}{path2}})" will only return groups() as ("{path2}",)
Then I found an alternative way to do this:
gpathRegexPat=r"(?:\\mypath\{)((\{[^\{\}\[\]]*\})*)(?:\})"
gpathRegexCp=re.compile(gpathRegexPat)
strpath=gpathRegexCp.search(r'\mypath{{sadf}{ad}}').groups()[0]
>>> strpath
'{sadf}{ad}'
p=re.compile('\{([^\{\}\[\]]*)\}')
>>> p.findall(strpath)
['sadf', 'ad']
or:
>>> gpathRegexPat=r"\\mypath\{(\{[^{}[\]]*\})*\}"
>>> gpathRegexCp=re.compile(gpathRegexPat, flags=re.I|re.U)
>>> strpath=gpathRegexCp.search(r'\input{{whatever]{1}}\mypath{{sadf}{ad}}\shape{{0.2}{0.1}}').group()
>>> strpath
'\\mypath{{sadf}{ad}}'
>>> p.findall(strpath)
['sadf', 'ad']
At this point, I thought, why not just use the findall on the original string? I may use:
gpathRegexPat=r"(?:\\mypath\{)(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?(?:\})": if the first (?:\{[^\{\}\[\]]*\})*? matches 0 time and the 2nd (?:\{[^\{\}\[\]]*\})*? matches 1 time, it will capture sadf; if the first (?:\{[^\{\}\[\]]*\})*? matches 1 time, the 2nd one matches 0 time, it will capture ad. However, it will only return ['sadf'] with this regex.
With out all those extra patterns ((?:\\mypath\{) and (?:\})), it actually works:
>>> p2=re.compile(r'(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?')
>>> p2.findall(strpath)
['sadf', 'ad']
>>> p2.findall('{adadd}{dfada}{adafadf}')
['adadd', 'dfada', 'adafadf']
Can anyone explain this behavior to me? Is there any smarter way to achieve the result I want?

re.findall("{([^{}]+)}",text)
should work
returns
['path1', 'path2', 'path3', 'pathn']
finally
my_path = r"\input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}"
#get the \mypath part
my_path2 = [p for p in my_path.split("\\") if p.startswith("mypath")][0]
print re.findall("{([^{}]+)}",my_path2)
or even better
re.findall("{(path\d+)}",text) #will only return things like path<num> inside {}

You are right. It is not possible to return repeated subgroups inside a group. To do what you want, you can use a regular expression to capture the group and then use a second regular expression to capture the repeated subgroups.
In this case that would be something like: \\mypath{(?:\{.*?\})}. This will return {path1}{path2}{path3}
Then to find the repeating patterns of {pathn} inside that string, you can simply use \{(.*?)\}. This will match anything withing the braces. The .*? is a non-greedy version of .*, meaning it will return the shortest possible match instead of the longest possible match.

Related

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?
This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')
If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d
You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.
Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

Python - Regex to avoid matching duplicates

My string looks like this:
bo_1
bo_1
bo_2
bo_2
bo_3
bo_3
bo_4
bo_4
bo_5
bo_5
bo_6
bo_6
bo_7
bo_7
bo_8
bo_8
bo_9
bo_9
bo_10
bo_10
I want to match the first instance of each digit and ignore the next duplicate line. My regex is as follows:
(bo_\d)(?![\s\S]*\1)
which returns the following:
'bo_2'
'bo_3'
'bo_4'
'bo_5'
'bo_6'
'bo_7'
'bo_8'
'bo_9'
'bo_1'
How would I modify the regex to return a result like this instead (to include 'bo_1' at the start and 'bo_10' at the end):
'bo_1'
'bo_2'
'bo_3'
'bo_4'
'bo_5'
'bo_6'
'bo_7'
'bo_8'
'bo_9'
'bo_10'
Technically you don't need regex for that (you can use set() for instance):
>>> # Assume your string is in the variable called "text"
>>> result = set(text.split('\n'))
>>> result
{'bo_7', 'bo_3', 'bo_1', 'bo_6', 'bo_5', 'bo_8', 'bo_9', 'bo_2', 'bo_4', 'bo_10'}
Anyway, the issue with your regex is that bo_1 is also matching bo_10, so it will be seen as a duplicate by the regex. You can solve it using word boundaries to ensure that the full 'word' is tested for a match:
\b(bo_\d+)\b(?![\s\S]*\b\1\b)
regex101 demo
Use
(bo_\d+$)(?![\s\S]*^\1$)
Since you want to include bo_10, you should use \d+ and not just \d in the initial group. Then, in your negative lookahead, put the backrefrence between start-of-line and end-of-line anchors, so that, for example, bo_1 does not get excluded because it's followed by a bo_10.
https://regex101.com/r/8khbcc/1

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Logic for finding and excluding multiple matches from a list

I need to match contents of a list with a given pattern, and form another list which will be having everything except the matches. Meaning, I am trying to make an exclude list.
Now with one pattern match, it is easy. But for more that one, it becomes tricky.
Lets see an example :
Lmain=[arc123, arc234,xyz111,xyz222,ppp999,ppp888]
for count in range(len[Lmain]):
if Pattern matches Lmain[i]:
Pass
else:result.append(Lmain[i])
Now lets say pattern = arc , my result will be
result = [xyz111,xyz222,ppp999,ppp888]
This is just a logic, where I will be using regular expr for finding match.
Now if we have 2 patterns, then using above logic in a loop :
Pattern=['arc','xyz']
for pat in Pattern:
if pat matches Lmain[i]:
Pass
else:result.append(Lmain[i])
This will give us the wrong result
result = [xyz111,xyz222,ppp999,ppp888,arc123,arc234,ppp999,ppp888]
So , you can see above logic just wont work .
My plan:
First we find exclude list for first Pattern which will give us result:
result = [xyz111,xyz222,ppp999,ppp888]
For 2nd pattern, we need to look in to the above result.
if Pattern matches Result[i]:
Pass
else:result_final.append(Result[i])
I think we need to use Recursion to implement above logic. Now how do we do that?
Also we dont know the number of patterns user is going to enter. It can be one or more.
Anybody has any logic ideas, then please share.
Using a list comprehension and a generator expression, and skipping the intermediate step of building an exclude list and just building the final list:
>>> import re
>>> Lmain=['arc123', 'arc234', 'xyz111', 'xyz222','ppp999','ppp888']
>>> Pattern=['arc','xyz']
>>> [x for x in Lmain if not any(re.search(y, x) for y in Pattern)]
['ppp999', 'ppp888']
for item in lst:
if all(pat not in item for pat in patterns):
exclude_list.append(item)
Replace in to what is more appropriate in your case (e.g. item.startswith(pat))
If there are more matches than non-matches, it should be more efficient to find the matches first, and then exclude them:
matches = [x for x in lst if any(x.startswith(p) for p in patterns)]
exclude_list = list(set(lst).difference(matches))
Yet another (and probably the fastest) option is to use regular expressions (here in combination with filter):
import re
expr = '^(?!%s)' % '|'.join(patterns)
exclude_list = filter(re.compile(expr).search, lst)
matched = False
for pat in Pattern:
if pat patches Lmain[i]:
matched = True
break;
if matched:
Pass
else:
result.append(Lmain[i])

How do I regex match with grouping with unknown number of groups

I want to do a regex match (in Python) on the output log of a program. The log contains some lines that look like this:
...
VALUE 100 234 568 9233 119
...
VALUE 101 124 9223 4329 1559
...
I would like to capture the list of numbers that occurs after the first incidence of the line that starts with VALUE. i.e., I want it to return ('100','234','568','9233','119'). The problem is that I do not know in advance how many numbers there will be.
I tried to use this as a regex:
VALUE (?:(\d+)\s)+
This matches the line, but it only captures the last value, so I just get ('119',).
What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():
s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
print [int(x) for x in a[1:]]
You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.
>>> import re
>>> reg = re.compile('\d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']
That doesn't validate that the keyword 'VALUE' appears at the beginning of the string, and it doesn't validate that there is exactly one space between items, but if you can do that as a separate step (or if you don't need to do that at all), then it will find all digit sequences in any string.
Another option not described here is to have a bunch of optional capturing groups.
VALUE *(\d+)? *(\d+)? *(\d+)? *(\d+)? *(\d+)? *$
This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(\d+)? blocks.
You could just run you're main match regex then run a secondary regex on those matches to get the numbers:
matches = Regex.Match(log)
foreach (Match match in matches)
{
submatches = Regex2.Match(match)
}
This is of course also if you don't want to write a full parser.
I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I'm interested in and the second one to parse the sub groups. For example in this case, I'd start with this:
VALUE((\s\d+)+)
This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.
[0] and [2] can be ignored and then [1] can be used with the following:
\s(\d+)
Note: these regexps were not tested, I hope you get the idea though.
The reason why Greg's answer doesn't work for me is because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.
However, I would honestly go with Greg's solution for this question (it's probably way more efficient).
I'm just writing this answer in case someone is looking for a more sophisticated solution like I needed.
You can use re.match to check first and call re.split to use a regex as separator to split.
>>> s = "VALUE 100 234 568 9233 119"
>>> sep = r"\s+"
>>> reg = re.compile(r"VALUE(%s\d+)+"%(sep)) # OR r"VALUE(\s+\d+)+"
>>> reg_sep = re.compile(sep)
>>> if reg.match(s): # OR re.match(r"VALUE(\s+\d+)+", s)
... result = reg_sep.split(s)[1:] # OR re.split(r"\s+", s)[1:]
>>> result
['100', '234', '568', '9233', '119']
The separator "\s+" can be more complicated.

Categories

Resources