Capture contents in regular expression - python

I have the following text:
text = itunes20140618.tbz
How would I capture the date here, using a regular expression?
I am currently doing:
date = text.split('.tbz')[0].split('itunes')[-1]
I think using a re.findall here would be cleaner for what I am trying to do. Please note in the regular expression, it needs to be after the specific word "itunes" for the capture group (not just not numbers).

You can use re.search to find your desired match.
>>> import re
>>> re.search(r'\d+', 'itunes20140618.tbz').group()
'20140618'
Since you state it has to be after the word itunes, you can use a capturing group and refer to that group number to access your match.
>>> import re
>>> re.search(r'itunes(\d+)', 'itunes20140618.tbz').group(1)
'20140618'
You can also use a Positive Lookbehind to assure it's after the word itunes.
>>> re.search(r'(?<=itunes)\d+', 'itunes20140618.tbz').group()
'20140618'

Regex:
[^\d]*(\d+).*
Live demo
If you guarantee that the expression is going to be of this form:
itunes followed by date, then you can also use this:
itunes(\d+).*

Related

python regex match anything between string multiple conditions

I am trying to get my hands dirty on regex.
Sample String
Info:Somestring-103409115825.Call
Info: BIL*ONL*00003.Avbl
Currently This Matches
>>> print(re.search(r'Info:(.*?).Call', payload).group(1))
Somestring-103409115825
>>> print(re.search(r'Info:(.*?).Avbl', payload).group(1))
BIL*ONL*00003
how to make regex to match both conudtion like Info -- AnyString -- .Call|.Avbl ?
You can escape the dot and place it before using an alternation to match either Call or Avbl
\bInfo:(.*?)\.(?:Call|Avbl)\b
Regex demo
import re
pattern = r"\bInfo:(.*?)\.(?:Call|Avbl)\b"
print(re.search(pattern, "Info:Somestring-103409115825.Call").group(1))
print(re.search(pattern, "Info: BIL*ONL*00003.Avbl").group(1))
Output
Somestring-103409115825
BIL*ONL*00003
If you don't want the leading space before the group value, you can use:
\bInfo:\s*(.*?)\.(?:Call|Avbl)\b
See a Python demo
Please try this code.
import re
payload1 = 'Info:Somestring-103409115825.Call'
payload2 = 'Info: BIL*ONL*00003.Avbl'
print(re.search(r'Info:(.*?)((.Call)|(.Avbl))', payload1).group(1))
print(re.search(r'Info:(.*?)((.Call)|(.Avbl))', payload2).group(1))

Python matching dashes using Regular Expressions

I am currently new to Regular Expressions and would appreciate if someone can guide me through this.
import re
some = "I cannot take this B01234-56-K-9870 to the house of cards"
I have the above string and trying to extract the string with dashes (B01234-56-K-9870) using python regular expression. I have following code so far:
regex = r'\w+-\w+-\w+-\w+'
match = re.search(regex, some)
print(match.group()) #returns B01234-56-K-9870
Is there any simpler way to extract the dash pattern using regular expression? For now, I do not care about the order or anything. I just wanted it to extract string with dashes.
Try the following regex (as shortened by The fourth bird),
\w+-\S+
Original regex: (?=\w+-)\S+
Explanation:
\w+- matches 1 or more words followed by a -
\S+ matches non-space characters
Regex demo!

How to copy subsequent text after matching a pattern?

I have a text file with each line look something like this -
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
Each line has keyword testcaseid followed by some test case id (in this case blt12_0001 is the id and s3 and n4 are some parameters). I want to extract blt12_0001 from the above line. Each testcaseid will have exactly 1 underscore '_' in-between. What would be a regex for this case and how can I store name of test case id in a variable.
You could make use of capturing groups:
testcaseid_([^_]+_[^_]+)
See a demo on regex101.com.
One of many possible ways in Python could be
import re
line = "GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4"
for id in re.finditer(r'testcaseid_([^_]+_[^_]+)', line):
print(id.group(1))
See a demo on ideone.com.
You can use this regex to capture your testcaseid given in your format,
(?<=testcaseid_)[^_]+_[^_]+
This essentially captures a text having exactly one underscore between them and preceded by testcaseid_ text using positive lookbehind. Here [^_]+ captures one or more any character other than underscore, followed by _ then again uses [^_]+ to capture one or more any character except _
Check out this demo
Check out this Python code,
import re
list = ['GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4', 'GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s6_n9']
for s in list:
grp = re.search(r'(?<=testcaseid_)[^_]+_[^_]+', s)
if grp:
print(grp.group())
Output,
blt12_0001
blt12_0001
Another option that might work would be:
import re
expression = r"[^_\r\n]+_[^_\r\n]+(?=(?:_[a-z0-9]{2}){2}$)"
string = '''
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
GeneralBKT_n24_-e_dee_testcaseid_blt81_0023_s4_n5
'''
print(re.findall(expression, string, re.M))
Output
['blt12_0001', 'blt81_0023']
Demo
RegEx Circuit
jex.im visualizes regular expressions:
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

python regular expression replace only in parentheses

I would like to replace the ー to - in a regular expression like \d+(ー)\d+(ー)\d+. I tried re.sub but it will replace all the text including the numbers. Is it possible to replace the word in parentheses only?
e.g.
sub('\d+(ー)\d+(ー)\d+','4ー3ー1','-') returns '4-3-1'. Assume that simple replace cannot be used because there are other ー that do not satisfy the regular expression. My current solution is to split the text and do replacement on the part which satisfy the regular expression.
You may use the Group Reference here.
import re
before = '4ー3ー1ーー4ー31'
after = re.sub(r'(\d+)ー(\d+)ー(\d+)', r'\1-\2-\3', before)
print(after) # '4-3-1ーー4ー31'
Here, r'\1' is the reference to the first group, a.k.a, the first parentheses.
You could use a function for the repl argument in re.sub to only touch the match groups.
import re
s = '1234ー2134ー5124'
re.sub("\d+(ー)\d+(ー)\d+", lambda x: x.group(0).replace('ー', '-'), s)
Using a slightly different pattern, you might be able to take advantage of a lookahead expression which does not consume the part of string it matches to. That is to say, a lookahead/lookbehind will match on a pattern with the condition that it also matches the component in the lookahead/lookbehind expression (rather than the entire pattern.)
re.sub("ー(?=\d+)", "-", s)
If you can live with a fixed-length expression for the part preceding the emdash you can combine the lookahead with a lookbehind to make the regex a little more conservative.
re.sub("(?<=\d)ー(?=\d+)", "-", s)
re.sub('\d+(ー)\d+(ー)\d+','4ー3ー1','-')
Like you pointed out, the output of the regular expression will be '-'. because you are trying to replace the entire pattern with a '-'. to replace the ー to - you can use
import re
input_string = '4ー3ー1'
re.sub('ー','-', input_string)
or you could do a find all on the digits and join the string with a '-'
'-'.join(re.findall('\d+', input_string))
both methods should give you '4-3-1'

Python Regex for alpha(alpha|digit)*

I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?
like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)

Categories

Resources