python regex find text with digits - python

I have text like:
sometext...one=1290...sometext...two=12985...sometext...three=1233...
How can I find one=1290 and two=12985 but not three or four or five? There are can be from 4 to 5 digits after =. I tried this:
import re
pattern = r"(one|two)+=+(\d{4,5})+\D"
found = re.findall(pattern, sometext, flags=re.IGNORECASE)
print(found)
It gives me results like: [('one', '1290')].
If i use pattern = r"((one|two)+=+(\d{4,5})+\D)" it gives me [('one=1290', 'one', '1290')]. How can I get just one=1290?

You were close. You need to use a single capture group (or none for that matter):
((?:one|two)+=+\d{4,5})+
Full code:
import re
string = 'sometext...one=1290...sometext...two=12985...sometext...three=1233...'
pattern = r"((?:one|two)+=+\d{4,5})+"
found = re.findall(pattern, string, flags=re.IGNORECASE)
print(found)
# ['one=1290', 'two=12985']

Make the inner groups non capturing: ((?:one|two)+=+(?:\d{4,5})+\D)

The reason that you are getting results like [('one', '1290')] rather than one=1290 is because you are using capture groups. Use:
r"(?:one|two)=(?:\d{4,5})(?=\D)"
I have removed the additional + repeaters, as they were (I think?) unnecessary. You don't want to match things like oneonetwo===1234, right?
Using (?:...) rather than (...) defines a non-capture group. This prevents the result of the capture from being returned, and you instead get the whole match.
Similarly, using (?=\D) defines a look-ahead - so this is excluded from the match result.

Related

See which component in regex alternation was captured

In regex alternation, is there a way to retrieve which alternation was matched? I just need the first alternation match, not all the alternations that match.
For example, I have a regex like this
pattern = r'(abc.*def|mno.*pqr|mno.*pqrt|.....)'
string = 'mnoxxxpqrt'
I want the output to be 'mno.*pqr'
How should I write the regex statement? Python language is preferred.
To do this efficiently without any iterations, you can put your desired sub-patterns in a list and join them into one alternation pattern with each sub-pattern enclosed in a capture group (so the resulting pattern looks like (abc.*def)|(mno.*pqr) instead of (abc.*def|mno.*pqr)). You can then obtain the group number of the sub-pattern with the Match object's lastindex attribute and in turn obtain the matching sub-pattern from the original list of sub-patterns:
import re
patterns = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
pattern = '|'.join(map('({})'.format, patterns))
string = 'mno_foobar_pqrt'
print(pattern)
print(patterns[re.search(pattern, string).lastindex - 1])
This outputs:
(abc.*def)|(mno.*pqr)|(mno.*pqrt)
mno.*pqr
Demo: https://replit.com/#blhsing/JointBruisedMention
You can use capture groups:
import re
string = 'abcxxxdef'
patterns = ['abc.*def', 'mno.*pqr']
match = re.match(r'((abc.*def)|(mno.*pqr))',string)
groups = match.groups()
alternations = []
for i in range(1, len(groups)):
if (groups[i] != None):
pattern = patterns[i-1]
break
print(pattern)
Result: mno.*pqr
Expressions inside round brackets are capture groups, they correspond to the 1st to last index of the response. The 0th index is the whole match.
Then you would need to find the index which matched. Except your patterns would need to be fined before hand.
Well you could iterate the terms in the regex alternation:
string = 'abcxxxdef'
pattern = r'(abc.*def|mno.*pqr)'
terms = pattern[1:-1].split("|")
for term in terms:
if re.search(term, string):
print("MATCH => " + term)
This prints:
MATCH => abc.*def
The right answer to the question How should I write the regex statement? should actually be:
There is no known way to write the regex statement using the provided regex pattern which will allow to extract from the regex search result the information which of the alternatives have triggered the match.
And as there is no way to do it using the given pattern it is necessary to change the regex pattern which then makes it possible to extract from the match the requested information.
A possible way around this regex engine limitation is proposed below, but it requires an additional regex pattern search and has the disadvantage that there is a chance that it fails for some special search pattern alternatives.
The below provided code allows usage of simpler regex patterns without defining groups and works the "other way around" by checking which of the alternate patterns triggers a match in the found match for the entire regex:
import re
pattern = r'abc.*def|mno.*pqr|mno.*pqrt'
text = 'mnoxxxpqrt'
match = re.match(pattern,text)[0]
print(next(p for p in pattern.split('|') if re.match(p, match)))
It might fail in case when in the text found match string fails to be also a match for the single regex pattern what can happen for example if a non-word boundary \B requirement is used in the search pattern ( as mentioned in the comments by Kelly Bundy ).
A not failing alternative solution is to perform the regex search using a modified regex pattern. Below an approach using a dictionary for defining the alternatives and a function returning the matched group:
import re
dct_alts = {1:r'(abc.*def)',2:r'(mno.*pqr)',3:r'(mno.*pqrt)'}
# ^-- the dictionary index is the index of the matching group in the found match.
text = 'mnoxxxpqrt'
def get_matched_group(dct_alts, text):
pattern = '|'.join(dct_alts.values())
re_match = re.match(pattern, text)
return(dct_alts[re_match.lastindex])
print(get_matched_group(dct_alts, text))
prints
(mno.*pqr)
For the sake of completeness a function returning a list of all of the alternatives which give a match (not only the first one which matches):
import re
lst_alts = [r'abc.*def', r'mno.*pqr', r'mno.*pqrt']
text = 'mnoxxxpqrt'
def get_all_matched_groups(lst_alts, text):
matches = []
for pattern in lst_alts:
re_match = re.match(pattern, text)
if re_match:
matches.append(pattern)
return matches
print(get_all_matched_groups(lst_alts, text))
prints
['mno.*pqr', 'mno.*pqrt']

How to exclude some characters from the text matched group?

I am going to match two cases: 123456-78-9, or 123456789. My goal is to retrieve 123456789 from either case, ie to exclude the '-' from the first case, no need to mention that the second case is quite straightforward.
I have tried to use a regex like r"\b(\d+(?:-)?\d+(?:-)?\d)\b", but it still gives '123456-78-9' back to me.
what is the right regex I should use? Though I know do it in two steps: 1) get three parts of digits by regex 2) use another line to concat them, but I still prefer a regex so that the code is more elegant.
Thanks for any advices!
You can use r'(\d{6})(-?)(\d{2})\2(\d)'
Then Join groups 1, 3 and 4, or replace using "\\1\\3\\4"
Will only match these two inputs:
123456-78-9, or 123456789
It's up to you to put boundary conditions on it if needed.
https://regex101.com/r/ceB10E/1
You may put the numbers parts in capturing groups and then replace the entire match with just the captured groups.
Try something like:
\b(\d+)-?(\d+)-?(\d)\b
..and replace with:
\1\2\3
Note that the two non-capturing groups you're using are redundant. (?:-)? = -?.
Regex demo.
Python example:
import re
regex = r"\b(\d+)-?(\d+)-?(\d)\b"
test_str = ("123456-78-9\n"
"123456789")
subst = "\\1\\2\\3"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Output:
123456789
123456789
Try it online.
The easiest thing to do here would be to first use re.sub to remove all non digit characters from the input. Then, use an equality comparison to check the input:
inp = "123456-78-9"
if re.sub(r'\D', '', inp) == '123456789':
print("MATCH")
Edit: If I misunderstood your problem, and instead the inputs could be anything, and you just want to match the two formats given, then use an alternation:
\b(?:\d{6}-\d{2}-\d|\d{9})\b
Script:
inp = "123456-78-9"
if re.search(r'\b(?:\d{6}-\d{2}-\d|\d{9})\b', inp):
print("MATCH")

Python regex to find characters inside delimiters

I have a more challenging task, but first I am faced with this issue. Given a string s, I want to extract all the groups of characters marked by some delimiter, e.g. parentheses. How can I accomplish this using regular expressions (or any Pythonic way)?
import re
>>> s = '(3,1)-[(7,2),1,(a,b)]-8a'
>>> pattern = r'(\(.+\))'
>>> re.findall(pattern, s).group() # EDITED: findall vs. search
['(3,1)-[(7,2),1,(a,b)']
# Desire result
['(3,1)', '(7,2)', '(a,b)']
Use findall() instead of search(). The former finds all occurences, the latter only finds the first.
Use the non-greedy ? operator. Otherwise, you'll find a match starting at the first ( and ending at the final ).
Note that regular expressions aren't a good tool for finding nested expressions like: ((1,2),(3,4)).
import re
s = '(3,1)-[(7,2),1,(a,b)]-8a'
pattern = r'(\(.+?\))'
print re.findall(pattern, s)
Use re.findall()
import re
data = '(3,1)-[(7,2),1,(a,b)]-8a'
found = re.findall('(\(\w,\w\))', data)
print found
Output:
['(3,1)', '(7,2)', '(a,b)']

Python - Replace part of regex?

I have the following piece of code:
import re
s = "Example {String}"
replaced = re.sub('{.*?}', 'a', s)
print replaced
Which prints:
Example a
Is there a way to modify the regex so that it prints:
Example {a}
That is I would like to replace only the .* part and keep all the surrounding characters. However, I need the surrounding characters to define my regex. Is this possible?
Two ways to do this, either capture your pre- and post-fixes, or use lookbehinds and lookaheads.
# capture
s = "Example {String}"
replaced = re.sub(r'({).*?(})', r'\1a\2', s)
# lookaround
s = "Example {String}"
replaced = re.sub(r'(?<={).*?(?=})', r'a', s)
If you capture you pre- and post-fixes, you're able to use those capture groups in your substitution using \{capture_num}, e.g. \1 and \2.
If you use lookarounds, you essentially DON'T MATCH those pre-/post-fixes, you simply assert that they are there. Thus the substitution only swaps the letters in between.
Of course for this simple sub, the better solution is probably:
re.sub(r'{.*?}', r'{a}', s)

Regex to ensure group match doesn't end with a specific character

I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:
Name.Of.Show.S01E01
Name.Of.Show.0101
Name.Of.Show.01x01
Name.Of.Show.101
What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:
"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"
Some Examples:
>>> import re
>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')
So the question is how do I avoid the first group ending with a period? I realize I could simply do:
var.strip(".")
However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?
Thanks in advance.
I think this will do:
>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')
ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:
>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')
So the only real restriction on the last group is that it doesn’t contain a dot? Easy:
^(.*?)(\.[^.]+)$
This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.
This works with all your test cases.
It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.
I believe this will do what you want:
^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$
I tested this against the following list of shows:
30.Rock.S01E01
The.Office.0101
Lost.01x01
How.I.Met.Your.Mother.101
If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.
If the last part never contains a dot: ^(.*)\.([^\.]+)$

Categories

Resources