I'm having problems with this regex. I want to pull out just MATCH3, because the others, MATCH1 and MATCH2 are commented out.
# url(r'^MATCH1/$',),
#url(r'^MATCH2$',),
url(r'^MATCH3$',), # comment
The regex I have captures all of the MATCH's.
(?<=url\(r'\^)(.*?)(?=\$',)
How do I ignore lines beginning with a comment? With a negative lookahead? Note the # character is not necessarily at the start of the line.
EDIT: sorry, all answers are good! the example forgot a comma after the $' at the end of the match group.
You really don't need to use lookarounds here, you could look for possible leading whitespace and then match "url" and the preceding context; capturing the part you want to retain.
>>> import re
>>> s = """# url(r'^MATCH1/$',),
#url(r'^MATCH2$',),
url(r'^MATCH3$',), # comment"""
>>> re.findall(r"(?m)^\s*url\(r'\^([^$]+)", s)
['MATCH3']
^\s*#.*$|(?<=url\(r'\^)(.*?)(?=\$'\))
Try this.Grab the capture.See demo.
https://www.regex101.com/r/rK5lU1/37
import re
p = re.compile(r'^\s*#.*$|(?<=url\(r\'\^)(.*?)(?=\$\'\))', re.IGNORECASE | re.MULTILINE)
test_str = "# url(r'^MATCH1/$'),\n #url(r'^MATCH2$'),\n url(r'^MATCH3$') # comment"
re.findall(p, test_str)
If this is the only place where you need to match, then match beginning of line followed by optional whitespace followed by url:
(?m)^\s*url\(r'(.*?)'\)
If you need to cover more complicated cases, I'd suggest using ast.parse instead, as it truly understands the Python source code parsing rules.
import ast
tree = ast.parse("""(
# url(r'^MATCH1/$'),
#url(r'^MATCH2$'),
url(r'^MATCH3$') # comment
)""")
class UrlCallVisitor(ast.NodeVisitor):
def visit_Call(self, node):
if getattr(node.func, 'id', None) == 'url':
if node.args and isinstance(node.args[0], ast.Str):
print(node.args[0].s.strip('$^'))
self.generic_visit(node)
UrlCallVisitor().visit(tree)
prints each first string literal argument given to function named url; in this case, it prints MATCH3. Notice that the source for ast.parse needs to be a well-formed Python source code (thus the parenthesis, otherwise a SyntaxError is raised).
As an alternative you can split your lines with '#' if the first element has 'url' in (it doesn't start with # ) you can use re.search to match the sub-string that you want :
>>> [re.search(r"url\(r'\^(.*?)\$'" ,i[0]).group(1) for i in [line.split('#') for line in s.split('\n')] if 'url' in i[0]]
['MATCH3']
Also note that you dont need to sue look-around for your pattern you can just use grouping!
Related
Sample Javascript (content):
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("src","/cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray="+e.ray),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("sdfdsfsfds",'/test/path'),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
regex = ""
endpoints = re.findall(regex, content)
Output I want:
> /cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray=
> /test/path
I want to find all fields starting with "/ and '/ with regex. I've tried many url regexes but it didn't work for me.
This should do it:
regex = r"""["']\/[^"']*"""
Note that you will need to trim the first character from the match. This also assumes that there are no quotation marks in the path.
Consider:
import re
txt = ... #your code
pat = r"(\"|\')(\/.*?)\1"
for el in re.findall(pat, txt):
print(el[1])
each el will be match of pattern starting with single, or double quote. Then minimal number of characters, then the same character as at the beginning (same type of quote).
.* stands for whatever number of any characters, following ? makes it non-greedy i.e. provides minimal characters match. Then \1 refers to first group, so it will match whatever type of quote was matched at the beginning. Then by specifying el[1] we return second group matched i.e. whatever was matched within quotes.
I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)
I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.
import re
pattern = '''
(?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.match(yelp_html)
This prints None.
There is definitely text inside of yelp_html.
What am I doing wrong?
I see two issues:
You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.
I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).
import re
pattern = r'''
(?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})
(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.finditer(yelp_html)
for item in matches:
print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))
I'm trying to split a multiline string on a character but only if the line does not contain :. Unfortunately I can't see an easy way to use re.split() with negative lookback on the character : since it's possible that : occurred in another line earlier in the string.
As an example, I'd like to split the below string on ).
String:
Hello1 (
First : (),
Second )
Hello2 (
First
)
Output:
['Hello1 (\nFirst : (),\nSecond', 'Hello2 (\nFirst \n']
It is possible with Python, albeit not "out of the box" with the native re module.
First alternative
The newer regex module supports a variable-length lookbehind, so you could use
(?<=^[^:]+)\)
# pos. lookbehind making sure there's no : in that line
In Python:
import regex as re
data = """
Hello1 (
First : (),
Second )
Hello2 (
First
)"""
pattern = re.compile(r'(?<=^[^:]+)\)', re.MULTILINE)
parts = pattern.split(data)
print(parts)
Which yields
['\nHello1 (\nFirst : (),\nSecond ', '\n\nHello2 (\nFirst \n', '']
Second alternative
Alternatively, you could match the lines in question and let them fail with (*SKIP)(*FAIL) afterwards:
^[^:\n]*:.*(*SKIP)(*FAIL)|\)
# match lines with at least one : in it
# let them fail
# or match )
Again in Python:
pattern2 = re.compile(r'^[^:\n]*:.*(*SKIP)(*FAIL)|\)', re.MULTILINE)
parts2 = pattern.split(data)
print(parts2)
See a demo for the latter on regex101.com.
Third alternative
Ok, now the answer is getting longer than previously thought. You can even do it with the native re module with the help of a function. Here, you need to substitute the ) in question first and split by the substitute:
def replacer(match):
if match.group(1) is not None:
return "SUPERMAN"
else:
return match.group(0)
pattern3 = re.compile(r'^[^:\n]*:.*|(\))', re.MULTILINE)
data = pattern3.sub(replacer, data)
parts3 = data.split("SUPERMAN")
print(parts3)
I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(
You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.
re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.
Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)