How to copy subsequent text after matching a pattern?

How to copy subsequent text after matching a pattern? - python

I have a text file with each line look something like this -
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
Each line has keyword testcaseid followed by some test case id (in this case blt12_0001 is the id and s3 and n4 are some parameters). I want to extract blt12_0001 from the above line. Each testcaseid will have exactly 1 underscore '_' in-between. What would be a regex for this case and how can I store name of test case id in a variable.

You could make use of capturing groups:
testcaseid_([^_]+_[^_]+)
See a demo on regex101.com.
One of many possible ways in Python could be
import re
line = "GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4"
for id in re.finditer(r'testcaseid_([^_]+_[^_]+)', line):
print(id.group(1))
See a demo on ideone.com.

You can use this regex to capture your testcaseid given in your format,
(?<=testcaseid_)[^_]+_[^_]+
This essentially captures a text having exactly one underscore between them and preceded by testcaseid_ text using positive lookbehind. Here [^_]+ captures one or more any character other than underscore, followed by _ then again uses [^_]+ to capture one or more any character except _
Check out this demo
Check out this Python code,
import re
list = ['GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4', 'GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s6_n9']
for s in list:
grp = re.search(r'(?<=testcaseid_)[^_]+_[^_]+', s)
if grp:
print(grp.group())
Output,
blt12_0001
blt12_0001

Another option that might work would be:
import re
expression = r"[^_\r\n]+_[^_\r\n]+(?=(?:_[a-z0-9]{2}){2}$)"
string = '''
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
GeneralBKT_n24_-e_dee_testcaseid_blt81_0023_s4_n5
'''
print(re.findall(expression, string, re.M))
Output
['blt12_0001', 'blt81_0023']
Demo
RegEx Circuit
jex.im visualizes regular expressions:
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Related

Single regular expression for extracting different values

I have some inputs like
ID= 5657A
ID=PID=FSGDVD
IDS=5645SD
I have created a regex i.e IDS=[A-Za-z0-9]+|ID=[A-Za-z0-9]+|PID=[A-Za-z0-9]+. But, in the case of ID=PID=FSGDVD, I want PID=FSGDVD as output.
My outputs must look like
ID= 5657A
PID=FSGDVD
IDS=5645SD
How to go for this problem?

Add end of line anchor and use grouping and quantifiers to simplify the regex:
(?:IDS?|PID)=[A-Za-z0-9]+$
IDS? will match both ID and IDS
(?:IDS?|PID) will match ID or IDS or PID
(?:pattern) is a non-capturing group, some functions like re.split and re.findall will change their behavior based on capture groups, thus non-capturing group is ideal whenever backreferences aren't needed
$ is end of line anchor, thus you'll get the match towards end of line instead of start of line
Demo: https://regex101.com/r/e9uvmC/1
In case your input can be something like ID=PID=FSGDVD xyz then you could use lookarounds:
(?:IDS?|PID)=[A-Za-z0-9]+\b(?!=)
Here \b will ensure to match all word characters after = sign and (?!=) is a negative lookahead assertion to avoid a match if there is = afterwards
Demo: https://regex101.com/r/e9uvmC/2

Another one could be
[A-Z]+=\s*[^=]+$
See a demo on regex101.com.

How to substitute a regex with another regex in a string

This question showed how to replace a regex with another regex like this
$string = '"SIP/1037-00000014","SIP/CL-00000015","Dial","SIP/CL/61436523277,45"';
$$pattern = '["SIP/CL/(\d*),(\d*)",]';
$replacement = '"SIP/CL/\1|\2",';
$string = preg_replace($pattern, $replacement, $string);
print($string);
However, I couldn't adapt that pattern to solve my case where I want to remove the full stop that lies between 2 words but not between a word and a number:
text = 'this . is bad. Not . 820'
regex1 = r'(\w+)(\s\.\s)(\D+)'
regex2 = r'(\w+)(\s)(\D+)'
re.sub(regex1, regex2, text)
# Desired outcome:
'this is bad. Not . 820'
Basically I like to remove the . between the two alphabet words. Could someone please help me with this problem? Thank you in advance.

These expressions might be close to what you might have in mind:
\s[.](?=\s\D)
or
(?<=\s)[.](?=\s\D)
Test
import re
regex = r"\s[.](?=\s\D)"
test_str = "this . is bad. Not . 820"
print(re.sub(regex, "", test_str))
Output
this is bad. Not . 820
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Firstly, you can't really take PHP and apply it directly to Python, for obvious reasons.
Secondly, it always helps to specify which version of Python you're using as APIs change. Luckily in this instance, the API of re.sub has remained the same between Python 2.x and Python 3.
Onto your issue.
The second argument to re.sub is either a string or a function. If you pass in regex2 it'll just replace regex1 with the string contents of regex2, it won't apply regex2 as a regex.
If you want to use groups derived from the first regex (similar to your example, which is using \1 and \2 to extract the first and second matching group from the first regex), then you'd want to use a function, which takes a match object as its sole argument, which you could then use to extract matching groups and return them as part of the replacement string.

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?

After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

Regex match single characters between strings

I have a string with some markup which I'm trying to parse, generally formatted like this.
'[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
I want to match the asterisks within the [list] tags so I can re.sub them as [**] but I'm having trouble forming an expression to grab them. So far, I have:
match = re.compile('\[list\].+?\[/list\]', re.DOTALL)
This gets everything within the list, but I can't figure out a way to narrow it down to the asterisks alone. Any advice would be massively appreciated.

You may use a re.sub and use a lambda in the replacement part. You pass the match to the lambda and use a mere .replace('*','**') on the match value.
Here is the sample code:
import re
s = '[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
match = re.compile('\[list].+?\[/list]', re.DOTALL)
print(match.sub(lambda m: m.group().replace('*', '**'), s))
# = > [*]
# [list][**][**][/list][*]text[list][**][/list]
See the IDEONE demo
Note that a ] outside of a character class does not have to be escaped in Python re regex.

Capture contents in regular expression

I have the following text:
text = itunes20140618.tbz
How would I capture the date here, using a regular expression?
I am currently doing:
date = text.split('.tbz')[0].split('itunes')[-1]
I think using a re.findall here would be cleaner for what I am trying to do. Please note in the regular expression, it needs to be after the specific word "itunes" for the capture group (not just not numbers).

You can use re.search to find your desired match.
>>> import re
>>> re.search(r'\d+', 'itunes20140618.tbz').group()
'20140618'
Since you state it has to be after the word itunes, you can use a capturing group and refer to that group number to access your match.
>>> import re
>>> re.search(r'itunes(\d+)', 'itunes20140618.tbz').group(1)
'20140618'
You can also use a Positive Lookbehind to assure it's after the word itunes.
>>> re.search(r'(?<=itunes)\d+', 'itunes20140618.tbz').group()
'20140618'

Regex:
[^\d]*(\d+).*
Live demo
If you guarantee that the expression is going to be of this form:
itunes followed by date, then you can also use this:
itunes(\d+).*

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to copy subsequent text after matching a pattern? - python

Related

Single regular expression for extracting different values

How to substitute a regex with another regex in a string

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

Regex match single characters between strings

Capture contents in regular expression

Categories

Resources