How to add modifers to regex in python? - python

I want to add these two modifiers: g - global , i - insensitive
here is a par of my script:
search_pattern = r"/somestring/gi"
pattern = re.compile(search_pattern)
queries = pattern.findall(content)
but It does not work
the modifiers are as per on this page https://regex101.com/

First of all, I suggest that you should study regex101.com capabilities by checking all of its resources and sections. You can always see Python (and PHP and JS) code when clicking code generator link on the left.
How to obtain g global search behavior?
The findall in Python will get you matched texts in case you have no capturing groups. So, for r"somestring", findall is sufficient.
In case you have capturing groups, but you need the whole match, you can use finditer and access the .group(0).
How to obtain case-insensitive behavior?
The i modifier can be used as re.I or re.IGNORECASE modifiers in pattern = re.compile(search_pattern, re.I). Also, you can use inline flags: pattern = re.compile("(?i)somestring").
A word on regex delimiters
Instead of search_pattern = r"/somestring/gi" you should use search_pattern = r"somestring". This is due to the fact that flags are passed as a separate argument, and all actions are implemented as separate re methods.
So, you can use
import re
p = re.compile(r'somestring', re.IGNORECASE)
test_str = "somestring"
re.findall(p, test_str)
Or
import re
p = re.compile(r'(?i)(some)string')
test_str = "somestring"
print([(x.group(0),x.group(1)) for x in p.finditer(test_str)])
See IDEONE demo

When I started using python, I also wondered the same. But unfortunately, python does not provide special delimiters for creating regex. At the end of day, regex are just string. So, you cannot specify modifiers along with string unlike javascript or ruby.
Instead you need to compile the regex with modifiers.
regex = re.compile(r'something', re.IGNORECASE | re.DOTALL)
queries = regex.findall(content)

search_pattern = r"somestring"
pattern = re.compile(search_pattern,flags=re.I)
^^^
print pattern.findall("somestriNg")
You can set flags this way.findall is global by default.Also you dont need delimiters in python.

Related

How to copy subsequent text after matching a pattern?

I have a text file with each line look something like this -
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
Each line has keyword testcaseid followed by some test case id (in this case blt12_0001 is the id and s3 and n4 are some parameters). I want to extract blt12_0001 from the above line. Each testcaseid will have exactly 1 underscore '_' in-between. What would be a regex for this case and how can I store name of test case id in a variable.
You could make use of capturing groups:
testcaseid_([^_]+_[^_]+)
See a demo on regex101.com.
One of many possible ways in Python could be
import re
line = "GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4"
for id in re.finditer(r'testcaseid_([^_]+_[^_]+)', line):
print(id.group(1))
See a demo on ideone.com.
You can use this regex to capture your testcaseid given in your format,
(?<=testcaseid_)[^_]+_[^_]+
This essentially captures a text having exactly one underscore between them and preceded by testcaseid_ text using positive lookbehind. Here [^_]+ captures one or more any character other than underscore, followed by _ then again uses [^_]+ to capture one or more any character except _
Check out this demo
Check out this Python code,
import re
list = ['GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4', 'GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s6_n9']
for s in list:
grp = re.search(r'(?<=testcaseid_)[^_]+_[^_]+', s)
if grp:
print(grp.group())
Output,
blt12_0001
blt12_0001
Another option that might work would be:
import re
expression = r"[^_\r\n]+_[^_\r\n]+(?=(?:_[a-z0-9]{2}){2}$)"
string = '''
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
GeneralBKT_n24_-e_dee_testcaseid_blt81_0023_s4_n5
'''
print(re.findall(expression, string, re.M))
Output
['blt12_0001', 'blt81_0023']
Demo
RegEx Circuit
jex.im visualizes regular expressions:
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

How to substitute a regex with another regex in a string

This question showed how to replace a regex with another regex like this
$string = '"SIP/1037-00000014","SIP/CL-00000015","Dial","SIP/CL/61436523277,45"';
$$pattern = '["SIP/CL/(\d*),(\d*)",]';
$replacement = '"SIP/CL/\1|\2",';
$string = preg_replace($pattern, $replacement, $string);
print($string);
However, I couldn't adapt that pattern to solve my case where I want to remove the full stop that lies between 2 words but not between a word and a number:
text = 'this . is bad. Not . 820'
regex1 = r'(\w+)(\s\.\s)(\D+)'
regex2 = r'(\w+)(\s)(\D+)'
re.sub(regex1, regex2, text)
# Desired outcome:
'this is bad. Not . 820'
Basically I like to remove the . between the two alphabet words. Could someone please help me with this problem? Thank you in advance.
These expressions might be close to what you might have in mind:
\s[.](?=\s\D)
or
(?<=\s)[.](?=\s\D)
Test
import re
regex = r"\s[.](?=\s\D)"
test_str = "this . is bad. Not . 820"
print(re.sub(regex, "", test_str))
Output
this is bad. Not . 820
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
Firstly, you can't really take PHP and apply it directly to Python, for obvious reasons.
Secondly, it always helps to specify which version of Python you're using as APIs change. Luckily in this instance, the API of re.sub has remained the same between Python 2.x and Python 3.
Onto your issue.
The second argument to re.sub is either a string or a function. If you pass in regex2 it'll just replace regex1 with the string contents of regex2, it won't apply regex2 as a regex.
If you want to use groups derived from the first regex (similar to your example, which is using \1 and \2 to extract the first and second matching group from the first regex), then you'd want to use a function, which takes a match object as its sole argument, which you could then use to extract matching groups and return them as part of the replacement string.

RegEx for matching specific URLs

I'm trying to write a regex in python that that will either match a URL (for example https://www.foo.com/) or a domain that starts with "sc-domain:" but doesn't not have https or a path.
For example, the below entries should pass
https://www.foo.com/
https://www.foo.com/bar/
sc-domain:www.foo.com
However the below entries should fail
htps://www.foo.com/
https:/www.foo.com/bar/
sc-domain:www.foo.com/
sc-domain:www.foo.com/bar
scdomain:www.foo.com
Right now I'm working with the below:
^(https://*/|sc-domain:^[^/]*$)
This almost works, but still allows submissions like sc-domain:www.foo.com/ to go through. Specifically, the ^[^/]*$ part doesn't capture that a '/' should not pass.
^((?:https://\S+)|(?:sc-domain:[^/\s]+))$
You can try this.
See demo.
https://regex101.com/r/xXSayK/2
You can use this regex,
^(?:https?://www\.foo\.com(?:/\S*)*|sc-domain:www\.foo\.com)$
Explanation:
^ - Start of line
(?: - Start of non-group for alternation
https?://www\.foo\.com(?:/\S*)* - This matches a URL starting with http:// or https:// followed by www.foo.com and further optionally followed by path using
| - alternation for strings starting with sc-domain:
sc-domain:www\.foo\.com - This part starts matching with sc-domain: followed by www.foo.com and further does not allow any file path
)$ - Close of non-grouping pattern and end of string.
Regex Demo
Also, a little not sure whether you wanted to allow any random domain, but in case you want to allow, you can use this regex,
^(?:https?://(?:\w+\.)+\w+(?:/\S*)*|sc-domain:(?:\w+\.)+\w+)$
Regex Demo allowing any domain
This expression also would do that using two simple capturing groups that you can modify as you wish:
^((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$
I have also added http, which you can remove it if it may be undesired.
JavaScript Test
const regex = /^(((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com))$/gm;
const str = `https://www.foo.com/
https://www.foo.com/bar/
sc-domain:www.foo.com
http://www.foo.com/
http://www.foo.com/bar/
`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Test with Python
You can simply test with Python and add the capturing groups that are desired:
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$"
test_str = ("https://www.foo.com/\n"
"https://www.foo.com/bar/\n"
"sc-domain:www.foo.com\n"
"http://www.foo.com/\n"
"http://www.foo.com/bar/\n\n"
"htps://www.foo.com/\n"
"https:/www.foo.com/bar/\n"
"sc-domain:www.foo.com/\n"
"sc-domain:www.foo.com/bar\n"
"scdomain:www.foo.com")
subst = "$1 $2"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Edit
Based on Pushpesh's advice, you can use lookaround and simplify it to:
^((https?)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$

Regex match single characters between strings

I have a string with some markup which I'm trying to parse, generally formatted like this.
'[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
I want to match the asterisks within the [list] tags so I can re.sub them as [**] but I'm having trouble forming an expression to grab them. So far, I have:
match = re.compile('\[list\].+?\[/list\]', re.DOTALL)
This gets everything within the list, but I can't figure out a way to narrow it down to the asterisks alone. Any advice would be massively appreciated.
You may use a re.sub and use a lambda in the replacement part. You pass the match to the lambda and use a mere .replace('*','**') on the match value.
Here is the sample code:
import re
s = '[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
match = re.compile('\[list].+?\[/list]', re.DOTALL)
print(match.sub(lambda m: m.group().replace('*', '**'), s))
# = > [*]
# [list][**][**][/list][*]text[list][**][/list]
See the IDEONE demo
Note that a ] outside of a character class does not have to be escaped in Python re regex.

Parsing multi line comments from js using python

I want to get the contents of the multiline comments in a js file using python.
I tried this code sample
import re
code_m = """
/* This is a comment. */
"""
code_s = "/* This is a comment*/"
reg = re.compile("/\*(?P<contents>.*)\*/", re.DOTALL + re.M)
matches_m = reg.match(code_m)
matches_s = reg.match(code_s)
print matches_s # Give a match object
print matches_m # Gives None
I get matches_m as None. But matches_s works. What am I missing here?
match() only matches at the start of the string, use search() instead.
When using match(), it is like there is an implicit beginning of string anchor (\A) at the start of your regex.
As a side note, you don't need the re.M flag unless you are using ^ or $ in your regex and want them to match at the beginning and end of lines. You should also use a bitwise OR (re.S | re.M for example) instead of adding when combining multiple flags.
re.match tests to see if the string matches the regex. You're probably looking for re.search:
>>> reg.search(code_m)
<_sre.SRE_Match object at 0x7f293e94d648>
>>> reg.search(code_m).groups()
(' This is a comment. ',)

Categories

Resources