RegEx for matching specific URLs

RegEx for matching specific URLs - python

I'm trying to write a regex in python that that will either match a URL (for example https://www.foo.com/) or a domain that starts with "sc-domain:" but doesn't not have https or a path.
For example, the below entries should pass
https://www.foo.com/
https://www.foo.com/bar/
sc-domain:www.foo.com
However the below entries should fail
htps://www.foo.com/
https:/www.foo.com/bar/
sc-domain:www.foo.com/
sc-domain:www.foo.com/bar
scdomain:www.foo.com
Right now I'm working with the below:
^(https://*/|sc-domain:^[^/]*$)
This almost works, but still allows submissions like sc-domain:www.foo.com/ to go through. Specifically, the ^[^/]*$ part doesn't capture that a '/' should not pass.

^((?:https://\S+)|(?:sc-domain:[^/\s]+))$
You can try this.
See demo.
https://regex101.com/r/xXSayK/2

You can use this regex,
^(?:https?://www\.foo\.com(?:/\S*)*|sc-domain:www\.foo\.com)$
Explanation:
^ - Start of line
(?: - Start of non-group for alternation
https?://www\.foo\.com(?:/\S*)* - This matches a URL starting with http:// or https:// followed by www.foo.com and further optionally followed by path using
| - alternation for strings starting with sc-domain:
sc-domain:www\.foo\.com - This part starts matching with sc-domain: followed by www.foo.com and further does not allow any file path
)$ - Close of non-grouping pattern and end of string.
Regex Demo
Also, a little not sure whether you wanted to allow any random domain, but in case you want to allow, you can use this regex,
^(?:https?://(?:\w+\.)+\w+(?:/\S*)*|sc-domain:(?:\w+\.)+\w+)$
Regex Demo allowing any domain

This expression also would do that using two simple capturing groups that you can modify as you wish:
^((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$
I have also added http, which you can remove it if it may be undesired.
JavaScript Test
const regex = /^(((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com))$/gm;
const str = `https://www.foo.com/
https://www.foo.com/bar/
sc-domain:www.foo.com
http://www.foo.com/
http://www.foo.com/bar/
`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Test with Python
You can simply test with Python and add the capturing groups that are desired:
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$"
test_str = ("https://www.foo.com/\n"
"https://www.foo.com/bar/\n"
"sc-domain:www.foo.com\n"
"http://www.foo.com/\n"
"http://www.foo.com/bar/\n\n"
"htps://www.foo.com/\n"
"https:/www.foo.com/bar/\n"
"sc-domain:www.foo.com/\n"
"sc-domain:www.foo.com/bar\n"
"scdomain:www.foo.com")
subst = "$1 $2"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Edit
Based on Pushpesh's advice, you can use lookaround and simplify it to:
^((https?)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$

Related

Extracting words between two strings in ProperCase and Line breaks [duplicate]

For example, this regex
(.*)<FooBar>
will match:
abcde<FooBar>
But how do I get it to match across multiple lines?
abcde
fghij<FooBar>

Try this:
((.|\n)*)<FooBar>
It basically says "any character or a newline" repeated zero or more times.

It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:
/(.*)<FooBar>/s
The s at the end causes the dot to match all characters including newlines.

The question is, can the . pattern match any character? The answer varies from engine to engine. The main difference is whether the pattern is used by a POSIX or non-POSIX regex library.
A special note about lua-patterns: they are not considered regular expressions, but . matches any character there, the same as POSIX-based engines.
Another note on matlab and octave: the . matches any character by default (demo): str = "abcde\n fghij<Foobar>"; expression = '(.*)<Foobar>*'; [tokens,matches] = regexp(str,expression,'tokens','match'); (tokens contain a abcde\n fghij item).
Also, in all of boost's regex grammars the dot matches line breaks by default. Boost's ECMAScript grammar allows you to turn this off with regex_constants::no_mod_m (source).
As for oracle (it is POSIX based), use the n option (demo): select regexp_substr('abcde' || chr(10) ||' fghij<Foobar>', '(.*)<Foobar>', 1, 1, 'n', 1) as results from dual
POSIX-based engines:
A mere . already matches line breaks, so there isn't a need to use any modifiers, see bash (demo).
The tcl (demo), postgresql (demo), r (TRE, base R default engine with no perl=TRUE, for base R with perl=TRUE or for stringr/stringi patterns, use the (?s) inline modifier) (demo) also treat . the same way.
However, most POSIX-based tools process input line by line. Hence, . does not match the line breaks just because they are not in scope. Here are some examples how to override this:
sed - There are multiple workarounds. The most precise, but not very safe, is sed 'H;1h;$!d;x; s/\(.*\)><Foobar>/\1/' (H;1h;$!d;x; slurps the file into memory). If whole lines must be included, sed '/start_pattern/,/end_pattern/d' file (removing from start will end with matched lines included) or sed '/start_pattern/,/end_pattern/{{//!d;};}' file (with matching lines excluded) can be considered.
perl - perl -0pe 's/(.*)<FooBar>/$1/gs' <<< "$str" (-0 slurps the whole file into memory, -p prints the file after applying the script given by -e). Note that using -000pe will slurp the file and activate 'paragraph mode' where Perl uses consecutive newlines (\n\n) as the record separator.
gnu-grep - grep -Poz '(?si)abc\K.*?(?=<Foobar>)' file. Here, z enables file slurping, (?s) enables the DOTALL mode for the . pattern, (?i) enables case insensitive mode, \K omits the text matched so far, *? is a lazy quantifier, (?=<Foobar>) matches the location before <Foobar>.
pcregrep - pcregrep -Mi "(?si)abc\K.*?(?=<Foobar>)" file (M enables file slurping here). Note pcregrep is a good solution for macOS grep users.
See demos.
Non-POSIX-based engines:
php - Use the s modifier PCRE_DOTALL modifier: preg_match('~(.*)<Foobar>~s', $s, $m) (demo)
c# - Use RegexOptions.Singleline flag (demo): - var result = Regex.Match(s, #"(.*)<Foobar>", RegexOptions.Singleline).Groups[1].Value;- var result = Regex.Match(s, #"(?s)(.*)<Foobar>").Groups[1].Value;
powershell - Use the (?s) inline option: $s = "abcde`nfghij<FooBar>"; $s -match "(?s)(.*)<Foobar>"; $matches[1]
perl - Use the s modifier (or (?s) inline version at the start) (demo): /(.*)<FooBar>/s
python - Use the re.DOTALL (or re.S) flags or (?s) inline modifier (demo): m = re.search(r"(.*)<FooBar>", s, flags=re.S) (and then if m:, print(m.group(1)))
java - Use Pattern.DOTALL modifier (or inline (?s) flag) (demo): Pattern.compile("(.*)<FooBar>", Pattern.DOTALL)
kotlin - Use RegexOption.DOT_MATCHES_ALL : "(.*)<FooBar>".toRegex(RegexOption.DOT_MATCHES_ALL)
groovy - Use (?s) in-pattern modifier (demo): regex = /(?s)(.*)<FooBar>/
scala - Use (?s) modifier (demo): "(?s)(.*)<Foobar>".r.findAllIn("abcde\n fghij<Foobar>").matchData foreach { m => println(m.group(1)) }
javascript - Use [^] or workarounds [\d\D] / [\w\W] / [\s\S] (demo): s.match(/([\s\S]*)<FooBar>/)[1]
c++ (std::regex) Use [\s\S] or the JavaScript workarounds (demo): regex rex(R"(([\s\S]*)<FooBar>)");
vba vbscript - Use the same approach as in JavaScript, ([\s\S]*)<Foobar>. (NOTE: The MultiLine property of the RegExp object is sometimes erroneously thought to be the option to allow . match across line breaks, while, in fact, it only changes the ^ and $ behavior to match start/end of lines rather than strings, the same as in JavaScript regex)
behavior.)
ruby - Use the /m MULTILINE modifier (demo): s[/(.*)<Foobar>/m, 1]
rtrebase-r - Base R PCRE regexps - use (?s): regmatches(x, regexec("(?s)(.*)<FooBar>",x, perl=TRUE))[[1]][2] (demo)
ricustringrstringi - in stringr/stringi regex funtions that are powered with the ICU regex engine. Also use (?s): stringr::str_match(x, "(?s)(.*)<FooBar>")[,2] (demo)
go - Use the inline modifier (?s) at the start (demo): re: = regexp.MustCompile(`(?s)(.*)<FooBar>`)
swift - Use dotMatchesLineSeparators or (easier) pass the (?s) inline modifier to the pattern: let rx = "(?s)(.*)<Foobar>"
objective-c - The same as Swift. (?s) works the easiest, but here is how the option can be used: NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionDotMatchesLineSeparators error:&regexError];
re2, google-apps-script - Use the (?s) modifier (demo): "(?s)(.*)<Foobar>" (in Google Spreadsheets, =REGEXEXTRACT(A2,"(?s)(.*)<Foobar>"))
NOTES ON (?s):
In most non-POSIX engines, the (?s) inline modifier (or embedded flag option) can be used to enforce . to match line breaks.
If placed at the start of the pattern, (?s) changes the bahavior of all . in the pattern. If the (?s) is placed somewhere after the beginning, only those .s will be affected that are located to the right of it unless this is a pattern passed to Python's re. In Python re, regardless of the (?s) location, the whole pattern . is affected. The (?s) effect is stopped using (?-s). A modified group can be used to only affect a specified range of a regex pattern (e.g., Delim1(?s:.*?)\nDelim2.* will make the first .*? match across newlines and the second .* will only match the rest of the line).
POSIX note:
In non-POSIX regex engines, to match any character, [\s\S] / [\d\D] / [\w\W] constructs can be used.
In POSIX, [\s\S] is not matching any character (as in JavaScript or any non-POSIX engine), because regex escape sequences are not supported inside bracket expressions. [\s\S] is parsed as bracket expressions that match a single character, \ or s or S.

If you're using Eclipse search, you can enable the "DOTALL" option to make '.' match any character including line delimiters: just add "(?s)" at the beginning of your search string. Example:
(?s).*<FooBar>

In many regex dialects, /[\S\s]*<Foobar>/ will do just what you want. Source

([\s\S]*)<FooBar>
The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.

We can also use
(.*?\n)*?
to match everything including newline without being greedy.
This will make the new line optional
(.*?|\n)*?

In Ruby you can use the 'm' option (multiline):
/YOUR_REGEXP/m
See the Regexp documentation on ruby-doc.org for more information.

"." normally doesn't match line-breaks. Most regex engines allows you to add the S-flag (also called DOTALL and SINGLELINE) to make "." also match newlines.
If that fails, you could do something like [\S\s].

For Eclipse, the following expression worked:
Foo
jadajada Bar"
Regular expression:
Foo[\S\s]{1,10}.*Bar*

Note that (.|\n)* can be less efficient than (for example) [\s\S]* (if your language's regexes support such escapes) and than finding how to specify the modifier that makes . also match newlines. Or you can go with POSIXy alternatives like [[:space:][:^space:]]*.

Use:
/(.*)<FooBar>/s
The s causes dot (.) to match carriage returns.

Use RegexOptions.Singleline. It changes the meaning of . to include newlines.
Regex.Replace(content, searchText, replaceText, RegexOptions.Singleline);

In notepad++ you can use this
<table (.|\r\n)*</table>
It will match the entire table starting from
rows and columns
You can make it greedy, using the following, that way it will match the first, second and so forth tables and not all at once
<table (.|\r\n)*?</table>

In a Java-based regular expression, you can use [\s\S].

This works for me and is the simplest one:
(\X*)<FooBar>

Generally, . doesn't match newlines, so try ((.|\n)*)<foobar>.

In JavaScript you can use [^]* to search for zero to infinite characters, including line breaks.
$("#find_and_replace").click(function() {
var text = $("#textarea").val();
search_term = new RegExp("[^]*<Foobar>", "gi");;
replace_term = "Replacement term";
var new_text = text.replace(search_term, replace_term);
$("#textarea").val(new_text);
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<button id="find_and_replace">Find and replace</button>
<br>
<textarea ID="textarea">abcde
fghij<Foobar></textarea>

Solution:
Use pattern modifier sU will get the desired matching in PHP.
Example:
preg_match('/(.*)/sU', $content, $match);
Sources:
Pattern Modifiers

In the context of use within languages, regular expressions act on strings, not lines. So you should be able to use the regex normally, assuming that the input string has multiple lines.
In this case, the given regex will match the entire string, since "<FooBar>" is present. Depending on the specifics of the regex implementation, the $1 value (obtained from the "(.*)") will either be "fghij" or "abcde\nfghij". As others have said, some implementations allow you to control whether the "." will match the newline, giving you the choice.
Line-based regular expression use is usually for command line things like egrep.

Try: .*\n*.*<FooBar> assuming you are also allowing blank newlines. As you are allowing any character including nothing before <FooBar>.

I had the same problem and solved it in probably not the best way but it works. I replaced all line breaks before I did my real match:
mystring = Regex.Replace(mystring, "\r\n", "")
I am manipulating HTML so line breaks don't really matter to me in this case.
I tried all of the suggestions above with no luck. I am using .NET 3.5 FYI.

I wanted to match a particular if block in Java:
...
...
if(isTrue){
doAction();
}
...
...
}
If I use the regExp
if \(isTrue(.|\n)*}
it included the closing brace for the method block, so I used
if \(!isTrue([^}.]|\n)*}
to exclude the closing brace from the wildcard match.

Often we have to modify a substring with a few keywords spread across lines preceding the substring. Consider an XML element:
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>81</PercentComplete>
</TASK>
Suppose we want to modify the 81, to some other value, say 40. First identify .UID.21..UID., then skip all characters including \n till .PercentCompleted.. The regular expression pattern and the replace specification are:
String hw = new String("<TASK>\n <UID>21</UID>\n <Name>Architectural design</Name>\n <PercentComplete>81</PercentComplete>\n</TASK>");
String pattern = new String ("(<UID>21</UID>)((.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
String replaceSpec = new String ("$1$2$440$6");
// Note that the group (<PercentComplete>) is $4 and the group ((.|\n)*?) is $2.
String iw = hw.replaceFirst(pattern, replaceSpec);
System.out.println(iw);
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>40</PercentComplete>
</TASK>
The subgroup (.|\n) is probably the missing group $3. If we make it non-capturing by (?:.|\n) then the $3 is (<PercentComplete>). So the pattern and replaceSpec can also be:
pattern = new String("(<UID>21</UID>)((?:.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
replaceSpec = new String("$1$2$340$5")
and the replacement works correctly as before.

Typically searching for three consecutive lines in PowerShell, it would look like:
$file = Get-Content file.txt -raw
$pattern = 'lineone\r\nlinetwo\r\nlinethree\r\n' # "Windows" text
$pattern = 'lineone\nlinetwo\nlinethree\n' # "Unix" text
$pattern = 'lineone\r?\nlinetwo\r?\nlinethree\r?\n' # Both
$file -match $pattern
# output
True
Bizarrely, this would be Unix text at the prompt, but Windows text in a file:
$pattern = 'lineone
linetwo
linethree
'
Here's a way to print out the line endings:
'lineone
linetwo
linethree
' -replace "`r",'\r' -replace "`n",'\n'
# Output
lineone\nlinetwo\nlinethree\n

Option 1
One way would be to use the s flag (just like the accepted answer):
/(.*)<FooBar>/s
Demo 1
Option 2
A second way would be to use the m (multiline) flag and any of the following patterns:
/([\s\S]*)<FooBar>/m
or
/([\d\D]*)<FooBar>/m
or
/([\w\W]*)<FooBar>/m
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

python regex match all character including newline [duplicate]

For example, this regex
(.*)<FooBar>
will match:
abcde<FooBar>
But how do I get it to match across multiple lines?
abcde
fghij<FooBar>

Try this:
((.|\n)*)<FooBar>
It basically says "any character or a newline" repeated zero or more times.

It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:
/(.*)<FooBar>/s
The s at the end causes the dot to match all characters including newlines.

The question is, can the . pattern match any character? The answer varies from engine to engine. The main difference is whether the pattern is used by a POSIX or non-POSIX regex library.
A special note about lua-patterns: they are not considered regular expressions, but . matches any character there, the same as POSIX-based engines.
Another note on matlab and octave: the . matches any character by default (demo): str = "abcde\n fghij<Foobar>"; expression = '(.*)<Foobar>*'; [tokens,matches] = regexp(str,expression,'tokens','match'); (tokens contain a abcde\n fghij item).
Also, in all of boost's regex grammars the dot matches line breaks by default. Boost's ECMAScript grammar allows you to turn this off with regex_constants::no_mod_m (source).
As for oracle (it is POSIX based), use the n option (demo): select regexp_substr('abcde' || chr(10) ||' fghij<Foobar>', '(.*)<Foobar>', 1, 1, 'n', 1) as results from dual
POSIX-based engines:
A mere . already matches line breaks, so there isn't a need to use any modifiers, see bash (demo).
The tcl (demo), postgresql (demo), r (TRE, base R default engine with no perl=TRUE, for base R with perl=TRUE or for stringr/stringi patterns, use the (?s) inline modifier) (demo) also treat . the same way.
However, most POSIX-based tools process input line by line. Hence, . does not match the line breaks just because they are not in scope. Here are some examples how to override this:
sed - There are multiple workarounds. The most precise, but not very safe, is sed 'H;1h;$!d;x; s/\(.*\)><Foobar>/\1/' (H;1h;$!d;x; slurps the file into memory). If whole lines must be included, sed '/start_pattern/,/end_pattern/d' file (removing from start will end with matched lines included) or sed '/start_pattern/,/end_pattern/{{//!d;};}' file (with matching lines excluded) can be considered.
perl - perl -0pe 's/(.*)<FooBar>/$1/gs' <<< "$str" (-0 slurps the whole file into memory, -p prints the file after applying the script given by -e). Note that using -000pe will slurp the file and activate 'paragraph mode' where Perl uses consecutive newlines (\n\n) as the record separator.
gnu-grep - grep -Poz '(?si)abc\K.*?(?=<Foobar>)' file. Here, z enables file slurping, (?s) enables the DOTALL mode for the . pattern, (?i) enables case insensitive mode, \K omits the text matched so far, *? is a lazy quantifier, (?=<Foobar>) matches the location before <Foobar>.
pcregrep - pcregrep -Mi "(?si)abc\K.*?(?=<Foobar>)" file (M enables file slurping here). Note pcregrep is a good solution for macOS grep users.
See demos.
Non-POSIX-based engines:
php - Use the s modifier PCRE_DOTALL modifier: preg_match('~(.*)<Foobar>~s', $s, $m) (demo)
c# - Use RegexOptions.Singleline flag (demo): - var result = Regex.Match(s, #"(.*)<Foobar>", RegexOptions.Singleline).Groups[1].Value;- var result = Regex.Match(s, #"(?s)(.*)<Foobar>").Groups[1].Value;
powershell - Use the (?s) inline option: $s = "abcde`nfghij<FooBar>"; $s -match "(?s)(.*)<Foobar>"; $matches[1]
perl - Use the s modifier (or (?s) inline version at the start) (demo): /(.*)<FooBar>/s
python - Use the re.DOTALL (or re.S) flags or (?s) inline modifier (demo): m = re.search(r"(.*)<FooBar>", s, flags=re.S) (and then if m:, print(m.group(1)))
java - Use Pattern.DOTALL modifier (or inline (?s) flag) (demo): Pattern.compile("(.*)<FooBar>", Pattern.DOTALL)
kotlin - Use RegexOption.DOT_MATCHES_ALL : "(.*)<FooBar>".toRegex(RegexOption.DOT_MATCHES_ALL)
groovy - Use (?s) in-pattern modifier (demo): regex = /(?s)(.*)<FooBar>/
scala - Use (?s) modifier (demo): "(?s)(.*)<Foobar>".r.findAllIn("abcde\n fghij<Foobar>").matchData foreach { m => println(m.group(1)) }
javascript - Use [^] or workarounds [\d\D] / [\w\W] / [\s\S] (demo): s.match(/([\s\S]*)<FooBar>/)[1]
c++ (std::regex) Use [\s\S] or the JavaScript workarounds (demo): regex rex(R"(([\s\S]*)<FooBar>)");
vba vbscript - Use the same approach as in JavaScript, ([\s\S]*)<Foobar>. (NOTE: The MultiLine property of the RegExp object is sometimes erroneously thought to be the option to allow . match across line breaks, while, in fact, it only changes the ^ and $ behavior to match start/end of lines rather than strings, the same as in JavaScript regex)
behavior.)
ruby - Use the /m MULTILINE modifier (demo): s[/(.*)<Foobar>/m, 1]
rtrebase-r - Base R PCRE regexps - use (?s): regmatches(x, regexec("(?s)(.*)<FooBar>",x, perl=TRUE))[[1]][2] (demo)
ricustringrstringi - in stringr/stringi regex funtions that are powered with the ICU regex engine. Also use (?s): stringr::str_match(x, "(?s)(.*)<FooBar>")[,2] (demo)
go - Use the inline modifier (?s) at the start (demo): re: = regexp.MustCompile(`(?s)(.*)<FooBar>`)
swift - Use dotMatchesLineSeparators or (easier) pass the (?s) inline modifier to the pattern: let rx = "(?s)(.*)<Foobar>"
objective-c - The same as Swift. (?s) works the easiest, but here is how the option can be used: NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionDotMatchesLineSeparators error:&regexError];
re2, google-apps-script - Use the (?s) modifier (demo): "(?s)(.*)<Foobar>" (in Google Spreadsheets, =REGEXEXTRACT(A2,"(?s)(.*)<Foobar>"))
NOTES ON (?s):
In most non-POSIX engines, the (?s) inline modifier (or embedded flag option) can be used to enforce . to match line breaks.
If placed at the start of the pattern, (?s) changes the bahavior of all . in the pattern. If the (?s) is placed somewhere after the beginning, only those .s will be affected that are located to the right of it unless this is a pattern passed to Python's re. In Python re, regardless of the (?s) location, the whole pattern . is affected. The (?s) effect is stopped using (?-s). A modified group can be used to only affect a specified range of a regex pattern (e.g., Delim1(?s:.*?)\nDelim2.* will make the first .*? match across newlines and the second .* will only match the rest of the line).
POSIX note:
In non-POSIX regex engines, to match any character, [\s\S] / [\d\D] / [\w\W] constructs can be used.
In POSIX, [\s\S] is not matching any character (as in JavaScript or any non-POSIX engine), because regex escape sequences are not supported inside bracket expressions. [\s\S] is parsed as bracket expressions that match a single character, \ or s or S.

If you're using Eclipse search, you can enable the "DOTALL" option to make '.' match any character including line delimiters: just add "(?s)" at the beginning of your search string. Example:
(?s).*<FooBar>

In many regex dialects, /[\S\s]*<Foobar>/ will do just what you want. Source

([\s\S]*)<FooBar>
The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.

We can also use
(.*?\n)*?
to match everything including newline without being greedy.
This will make the new line optional
(.*?|\n)*?

In Ruby you can use the 'm' option (multiline):
/YOUR_REGEXP/m
See the Regexp documentation on ruby-doc.org for more information.

"." normally doesn't match line-breaks. Most regex engines allows you to add the S-flag (also called DOTALL and SINGLELINE) to make "." also match newlines.
If that fails, you could do something like [\S\s].

For Eclipse, the following expression worked:
Foo
jadajada Bar"
Regular expression:
Foo[\S\s]{1,10}.*Bar*

Note that (.|\n)* can be less efficient than (for example) [\s\S]* (if your language's regexes support such escapes) and than finding how to specify the modifier that makes . also match newlines. Or you can go with POSIXy alternatives like [[:space:][:^space:]]*.

Use:
/(.*)<FooBar>/s
The s causes dot (.) to match carriage returns.

Use RegexOptions.Singleline. It changes the meaning of . to include newlines.
Regex.Replace(content, searchText, replaceText, RegexOptions.Singleline);

In notepad++ you can use this
<table (.|\r\n)*</table>
It will match the entire table starting from
rows and columns
You can make it greedy, using the following, that way it will match the first, second and so forth tables and not all at once
<table (.|\r\n)*?</table>

In a Java-based regular expression, you can use [\s\S].

This works for me and is the simplest one:
(\X*)<FooBar>

Generally, . doesn't match newlines, so try ((.|\n)*)<foobar>.

In JavaScript you can use [^]* to search for zero to infinite characters, including line breaks.
$("#find_and_replace").click(function() {
var text = $("#textarea").val();
search_term = new RegExp("[^]*<Foobar>", "gi");;
replace_term = "Replacement term";
var new_text = text.replace(search_term, replace_term);
$("#textarea").val(new_text);
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<button id="find_and_replace">Find and replace</button>
<br>
<textarea ID="textarea">abcde
fghij<Foobar></textarea>

Solution:
Use pattern modifier sU will get the desired matching in PHP.
Example:
preg_match('/(.*)/sU', $content, $match);
Sources:
Pattern Modifiers

In the context of use within languages, regular expressions act on strings, not lines. So you should be able to use the regex normally, assuming that the input string has multiple lines.
In this case, the given regex will match the entire string, since "<FooBar>" is present. Depending on the specifics of the regex implementation, the $1 value (obtained from the "(.*)") will either be "fghij" or "abcde\nfghij". As others have said, some implementations allow you to control whether the "." will match the newline, giving you the choice.
Line-based regular expression use is usually for command line things like egrep.

Try: .*\n*.*<FooBar> assuming you are also allowing blank newlines. As you are allowing any character including nothing before <FooBar>.

I had the same problem and solved it in probably not the best way but it works. I replaced all line breaks before I did my real match:
mystring = Regex.Replace(mystring, "\r\n", "")
I am manipulating HTML so line breaks don't really matter to me in this case.
I tried all of the suggestions above with no luck. I am using .NET 3.5 FYI.

I wanted to match a particular if block in Java:
...
...
if(isTrue){
doAction();
}
...
...
}
If I use the regExp
if \(isTrue(.|\n)*}
it included the closing brace for the method block, so I used
if \(!isTrue([^}.]|\n)*}
to exclude the closing brace from the wildcard match.

Often we have to modify a substring with a few keywords spread across lines preceding the substring. Consider an XML element:
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>81</PercentComplete>
</TASK>
Suppose we want to modify the 81, to some other value, say 40. First identify .UID.21..UID., then skip all characters including \n till .PercentCompleted.. The regular expression pattern and the replace specification are:
String hw = new String("<TASK>\n <UID>21</UID>\n <Name>Architectural design</Name>\n <PercentComplete>81</PercentComplete>\n</TASK>");
String pattern = new String ("(<UID>21</UID>)((.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
String replaceSpec = new String ("$1$2$440$6");
// Note that the group (<PercentComplete>) is $4 and the group ((.|\n)*?) is $2.
String iw = hw.replaceFirst(pattern, replaceSpec);
System.out.println(iw);
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>40</PercentComplete>
</TASK>
The subgroup (.|\n) is probably the missing group $3. If we make it non-capturing by (?:.|\n) then the $3 is (<PercentComplete>). So the pattern and replaceSpec can also be:
pattern = new String("(<UID>21</UID>)((?:.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
replaceSpec = new String("$1$2$340$5")
and the replacement works correctly as before.

Typically searching for three consecutive lines in PowerShell, it would look like:
$file = Get-Content file.txt -raw
$pattern = 'lineone\r\nlinetwo\r\nlinethree\r\n' # "Windows" text
$pattern = 'lineone\nlinetwo\nlinethree\n' # "Unix" text
$pattern = 'lineone\r?\nlinetwo\r?\nlinethree\r?\n' # Both
$file -match $pattern
# output
True
Bizarrely, this would be Unix text at the prompt, but Windows text in a file:
$pattern = 'lineone
linetwo
linethree
'
Here's a way to print out the line endings:
'lineone
linetwo
linethree
' -replace "`r",'\r' -replace "`n",'\n'
# Output
lineone\nlinetwo\nlinethree\n

Option 1
One way would be to use the s flag (just like the accepted answer):
/(.*)<FooBar>/s
Demo 1
Option 2
A second way would be to use the m (multiline) flag and any of the following patterns:
/([\s\S]*)<FooBar>/m
or
/([\d\D]*)<FooBar>/m
or
/([\w\W]*)<FooBar>/m
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

How to catch a string using regex in python and replace it by desired string

I am new to python and I wrote the following code which suppose to catch a specific string and replace it with a specific string as well.
sid=\"1722407313768658\"
I used this regex: sid=(.+?)
but it catches irrelevant string as well
https://tmobile.demdex.net/dest5.html?d_nsid=0#
as well when I am running this regex on sid=\"1722407313768658\" (replacing it with 1900117189066752 , I am getting the following result which does not replace the string but add i: sid=\1900117189066752\ "1722407313768658\"
(instead of 1722407313768658 i want to have 1900117189066752 )
this is my python code:
import re
content = c.read()
################################################################
# change sessionid in content
replace_small_sid = str('sid=\\' + "\\"+str(sid) + "\\" + " ")
content = re.sub("sid=(.+?)", replace_small_sid, content)

As I understand it you wish to match string patterns in the form:
sid=\"1722407313768658\"
With the aim of replacing the digits.
To achieve this we can use positive lookbehinds and lookaheads as described here:
https://www.regular-expressions.info/lookaround.html
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
In this case our lookbehind will match
sid=\"
Our lookahead will match
\"
Please see the example here: https://regex101.com/r/2pXcMI/2
Finally, we can use this to match and replace as follows:
import re
line = "sid=\"1722407313768658\" safklabsf ipashf oiasfoi asbg fasnk sid=\"65641\" asjobfaosb asbfaosb asf asfauv sid=\"651564165\"."
replace_with = '1900117189066752'
line = re.sub('(?<=sid=\\\")\d+(?=\\\")', replace_with, line)
line
This returns
'sid="1900117189066752" safklabsf ipashf oiasfoi asbg fasnk sid="1900117189066752" asjobfaosb asbfaosb asf asfauv sid="1900117189066752".'

since you want to replace specific string, you can do it by:
content.replace("1722407313768658","1900117189066752")

How to add modifers to regex in python?

I want to add these two modifiers: g - global , i - insensitive
here is a par of my script:
search_pattern = r"/somestring/gi"
pattern = re.compile(search_pattern)
queries = pattern.findall(content)
but It does not work
the modifiers are as per on this page https://regex101.com/

First of all, I suggest that you should study regex101.com capabilities by checking all of its resources and sections. You can always see Python (and PHP and JS) code when clicking code generator link on the left.
How to obtain g global search behavior?
The findall in Python will get you matched texts in case you have no capturing groups. So, for r"somestring", findall is sufficient.
In case you have capturing groups, but you need the whole match, you can use finditer and access the .group(0).
How to obtain case-insensitive behavior?
The i modifier can be used as re.I or re.IGNORECASE modifiers in pattern = re.compile(search_pattern, re.I). Also, you can use inline flags: pattern = re.compile("(?i)somestring").
A word on regex delimiters
Instead of search_pattern = r"/somestring/gi" you should use search_pattern = r"somestring". This is due to the fact that flags are passed as a separate argument, and all actions are implemented as separate re methods.
So, you can use
import re
p = re.compile(r'somestring', re.IGNORECASE)
test_str = "somestring"
re.findall(p, test_str)
Or
import re
p = re.compile(r'(?i)(some)string')
test_str = "somestring"
print([(x.group(0),x.group(1)) for x in p.finditer(test_str)])
See IDEONE demo

When I started using python, I also wondered the same. But unfortunately, python does not provide special delimiters for creating regex. At the end of day, regex are just string. So, you cannot specify modifiers along with string unlike javascript or ruby.
Instead you need to compile the regex with modifiers.
regex = re.compile(r'something', re.IGNORECASE | re.DOTALL)
queries = regex.findall(content)

search_pattern = r"somestring"
pattern = re.compile(search_pattern,flags=re.I)
^^^
print pattern.findall("somestriNg")
You can set flags this way.findall is global by default.Also you dont need delimiters in python.

Parsing multi line comments from js using python

I want to get the contents of the multiline comments in a js file using python.
I tried this code sample
import re
code_m = """
/* This is a comment. */
"""
code_s = "/* This is a comment*/"
reg = re.compile("/\*(?P<contents>.*)\*/", re.DOTALL + re.M)
matches_m = reg.match(code_m)
matches_s = reg.match(code_s)
print matches_s # Give a match object
print matches_m # Gives None
I get matches_m as None. But matches_s works. What am I missing here?

match() only matches at the start of the string, use search() instead.
When using match(), it is like there is an implicit beginning of string anchor (\A) at the start of your regex.
As a side note, you don't need the re.M flag unless you are using ^ or $ in your regex and want them to match at the beginning and end of lines. You should also use a bitwise OR (re.S | re.M for example) instead of adding when combining multiple flags.

re.match tests to see if the string matches the regex. You're probably looking for re.search:
>>> reg.search(code_m)
<_sre.SRE_Match object at 0x7f293e94d648>
>>> reg.search(code_m).groups()
(' This is a comment. ',)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

RegEx for matching specific URLs - python

^((?:https://\S+)|(?:sc-domain:[^/\s]+))$ You can try this. See demo. https://regex101.com/r/xXSayK/2

Related

Extracting words between two strings in ProperCase and Line breaks [duplicate]

python regex match all character including newline [duplicate]

How to catch a string using regex in python and replace it by desired string

How to add modifers to regex in python?

Parsing multi line comments from js using python

Categories

Resources