I want to get the contents of the multiline comments in a js file using python.
I tried this code sample
import re
code_m = """
/* This is a comment. */
"""
code_s = "/* This is a comment*/"
reg = re.compile("/\*(?P<contents>.*)\*/", re.DOTALL + re.M)
matches_m = reg.match(code_m)
matches_s = reg.match(code_s)
print matches_s # Give a match object
print matches_m # Gives None
I get matches_m as None. But matches_s works. What am I missing here?
match() only matches at the start of the string, use search() instead.
When using match(), it is like there is an implicit beginning of string anchor (\A) at the start of your regex.
As a side note, you don't need the re.M flag unless you are using ^ or $ in your regex and want them to match at the beginning and end of lines. You should also use a bitwise OR (re.S | re.M for example) instead of adding when combining multiple flags.
re.match tests to see if the string matches the regex. You're probably looking for re.search:
>>> reg.search(code_m)
<_sre.SRE_Match object at 0x7f293e94d648>
>>> reg.search(code_m).groups()
(' This is a comment. ',)
Related
I'm trying to write a regex in python that that will either match a URL (for example https://www.foo.com/) or a domain that starts with "sc-domain:" but doesn't not have https or a path.
For example, the below entries should pass
https://www.foo.com/
https://www.foo.com/bar/
sc-domain:www.foo.com
However the below entries should fail
htps://www.foo.com/
https:/www.foo.com/bar/
sc-domain:www.foo.com/
sc-domain:www.foo.com/bar
scdomain:www.foo.com
Right now I'm working with the below:
^(https://*/|sc-domain:^[^/]*$)
This almost works, but still allows submissions like sc-domain:www.foo.com/ to go through. Specifically, the ^[^/]*$ part doesn't capture that a '/' should not pass.
^((?:https://\S+)|(?:sc-domain:[^/\s]+))$
You can try this.
See demo.
https://regex101.com/r/xXSayK/2
You can use this regex,
^(?:https?://www\.foo\.com(?:/\S*)*|sc-domain:www\.foo\.com)$
Explanation:
^ - Start of line
(?: - Start of non-group for alternation
https?://www\.foo\.com(?:/\S*)* - This matches a URL starting with http:// or https:// followed by www.foo.com and further optionally followed by path using
| - alternation for strings starting with sc-domain:
sc-domain:www\.foo\.com - This part starts matching with sc-domain: followed by www.foo.com and further does not allow any file path
)$ - Close of non-grouping pattern and end of string.
Regex Demo
Also, a little not sure whether you wanted to allow any random domain, but in case you want to allow, you can use this regex,
^(?:https?://(?:\w+\.)+\w+(?:/\S*)*|sc-domain:(?:\w+\.)+\w+)$
Regex Demo allowing any domain
This expression also would do that using two simple capturing groups that you can modify as you wish:
^((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$
I have also added http, which you can remove it if it may be undesired.
JavaScript Test
const regex = /^(((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com))$/gm;
const str = `https://www.foo.com/
https://www.foo.com/bar/
sc-domain:www.foo.com
http://www.foo.com/
http://www.foo.com/bar/
`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Test with Python
You can simply test with Python and add the capturing groups that are desired:
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$"
test_str = ("https://www.foo.com/\n"
"https://www.foo.com/bar/\n"
"sc-domain:www.foo.com\n"
"http://www.foo.com/\n"
"http://www.foo.com/bar/\n\n"
"htps://www.foo.com/\n"
"https:/www.foo.com/bar/\n"
"sc-domain:www.foo.com/\n"
"sc-domain:www.foo.com/bar\n"
"scdomain:www.foo.com")
subst = "$1 $2"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Edit
Based on Pushpesh's advice, you can use lookaround and simplify it to:
^((https?)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$
I have a bunch of quotes scraped from Goodreads stored in a bs4.element.ResultSet, with each element of type bs4.element.Tag. I'm trying to use regex with the re module in python 3.6.3 to clean the quotes and get just the text. When I iterate and print using [print(q.text) for q in quotes] some quotes look like this
“Don't cry because it's over, smile because it happened.”
―
while others look like this:
“If you want to know what a man's like, take a good look at how he
treats his inferiors, not his equals.”
―
,
Each also has some extra blank lines at the end. My thought was I could iterate through quotes and call re.match on each quote as follows:
cleaned_quotes = []
for q in quote:
match = re.match(r'“[A-Z].+$”', str(q))
cleaned_quotes.append(match.group())
I'm guessing my regex pattern didn't match anything because I'm getting the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Not surprisingly, printing the list gives me a list of None objects. Any ideas on what I might be doing wrong?
As you requested this for learning purpose, here's the regex answer:
(?<=“)[\s\s]+?(?=”)
Explanation:
We use a positive lookbehind to and lookahead to mark the beginning and end of the pattern and remove the quotes from result at the same time.
Inside of the quotes we lazy match anything with the .+?
Online Demo
Sample Code:
import re
regex = r"(?<=“)[\s\S]+?(?=”)"
cleaned_quotes = []
for q in quote:
m = re.search(regex, str(q))
if m:
cleaned_quotes.append(m.group())
Arguably, we do not need any regex flags. Add the g|gloabal flag for multiple matches. And m|multiline to process matches line by line (in such a scenario could be required to use [\s\S] instead of the dot to get line spanning results.)
This will also change the behavior of the positional anchors ^ and $, to match the end of the line instead of the string. Therefore, adding these positional anchors in-between is just wrong.
One more thing, I use re.search() since re.match() matches only from the beginning of the string. A common gotcha. See the documentation.
First of all, in your expression r'“[A-Z].+$”' end of line $ is defined before ", which is logically not possible.
To use $ in regexi for multiline strings, you should also specify re.MULTILINE flag.
Second - re.match expects to match the whole value, not find part of string that matches regular expression.
Meaning re.search should do what you initially expected to accomplish.
So the resulting regex could be:
re.search(r'"[A-Z].+"$', str(q), re.MULTILINE)
I'm currently writing an application that uses a framework to match certain phrases, currently it is supposed to match the following regex pattern:
Say \"(.*)\"
However, I've notices that my users are complaining about the fact that their OS sometimes copies and pastes 'curly quotes' in, what ends up happening is that users provide the following sentence:
Say "Hello world!" <-- Matches
Say “Hello world!” <-- Doesn't match!
Is there any way I can tell Python's regular expressions to treat these curly quotes the same as regular quotes?
Edit:
Turns out you can very easily tell Python to read your Regular Expression with a unicode string, I changed my code to the following and it worked:
u'Say (?:["“”])(.*)(?:["“”])'
# (?:["“”]) <-- Start non-capturing group, and match one of the three possible quote typesnot return it
# (.*) <-- Start a capture group, match anything and return it
# (?:["“”]) <-- Stop matching the string until another quote is found
You could just include the curly quotes in the regex:
Say [\"“”](.*)[\"“”]
As something you can replicate in the Python repl, it's like this:
>>> import re
>>> test_str = r'"Hello"'
>>> reg = r'["“”](.*)["“”]'
>>> m = re.search(reg, test_str)
>>> m.group(1)
'Hello'
>>> test_str = r'“Hello world!”'
>>> m = re.search(reg, test_str)
>>> m.group(1)
'\x80\x9cHello world!\xe2\x80'
As an alternative to Kyle's answer you can prepare string to your current regex by replacing curly quotes:
string.replace('“', '"').replace('”', '"')
I want to add these two modifiers: g - global , i - insensitive
here is a par of my script:
search_pattern = r"/somestring/gi"
pattern = re.compile(search_pattern)
queries = pattern.findall(content)
but It does not work
the modifiers are as per on this page https://regex101.com/
First of all, I suggest that you should study regex101.com capabilities by checking all of its resources and sections. You can always see Python (and PHP and JS) code when clicking code generator link on the left.
How to obtain g global search behavior?
The findall in Python will get you matched texts in case you have no capturing groups. So, for r"somestring", findall is sufficient.
In case you have capturing groups, but you need the whole match, you can use finditer and access the .group(0).
How to obtain case-insensitive behavior?
The i modifier can be used as re.I or re.IGNORECASE modifiers in pattern = re.compile(search_pattern, re.I). Also, you can use inline flags: pattern = re.compile("(?i)somestring").
A word on regex delimiters
Instead of search_pattern = r"/somestring/gi" you should use search_pattern = r"somestring". This is due to the fact that flags are passed as a separate argument, and all actions are implemented as separate re methods.
So, you can use
import re
p = re.compile(r'somestring', re.IGNORECASE)
test_str = "somestring"
re.findall(p, test_str)
Or
import re
p = re.compile(r'(?i)(some)string')
test_str = "somestring"
print([(x.group(0),x.group(1)) for x in p.finditer(test_str)])
See IDEONE demo
When I started using python, I also wondered the same. But unfortunately, python does not provide special delimiters for creating regex. At the end of day, regex are just string. So, you cannot specify modifiers along with string unlike javascript or ruby.
Instead you need to compile the regex with modifiers.
regex = re.compile(r'something', re.IGNORECASE | re.DOTALL)
queries = regex.findall(content)
search_pattern = r"somestring"
pattern = re.compile(search_pattern,flags=re.I)
^^^
print pattern.findall("somestriNg")
You can set flags this way.findall is global by default.Also you dont need delimiters in python.
string = "RegisterParameter uri wub {"
RegisterName = re.findall("RegisterParameter uri ([^ ]*) {",string)
print 'RegisterName is :',RegisterName
See the above code. Here i want to find register name in the string i.e wub by regular expression. I have written the RE for that. If you run this code it will give the output like ['wub'] ,but i want only wub not bracket or quote. So what modifications to be done over here.
Many thanks for your help.
RegisterName is a list with just one str element. If the issue is just printing you could try:
print 'RegisterName is :', RegisterName[0]
Output:
RegisterName is : wub
PS:
When you are not sure of the type of a variable try printing it:
print type(RegisterName)
I would recommend you to use Python conventions, identifiers with names like SomeName are often used as names of classes. For variables, you could use some_name or register_name
You can use re.search() (or re.match() - depends on your needs) and get the capturing group:
>>> import re
>>> s = "RegisterParameter uri wub {"
>>> match = re.search("RegisterParameter uri ([^ ]*) {", s)
>>> match.group(1) if match else "Nothing found"
'wub'
Also, instead of [^ ]*, you may want to use \w*. \w matches any word character.
See also:
What is the difference between Python's re.search and re.match?
In regex, what does \w* mean?