Python regex match text between quotes - python

In the following script I would like to pull out text between the double quotes ("). However, the python interpreter is not happy and I can't figure out why...
import re
text = 'Hello, "find.me-_/\\" please help with python regex'
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
m = re.match(pattern, text)
print m.group()
The output should be find.me-/\.

match starts searching from the beginning of the text.
Use search instead:
#!/usr/bin/env python
import re
text = 'Hello, "find.me-_/\\" please help with python regex'
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
m = re.search(pattern, text)
print m.group()
match and search return None when they fail to match.
I guess you are getting AttributeError: 'NoneType' object has no attribute 'group' from python: This is because you are assuming you will match without checking the return from re.match.

If you write:
m = re.search(pattern, text)
match: searches at the beginning of text
search: searches all the string
Maybe this helps you to understand:
http://docs.python.org/library/re.html#matching-vs-searching

Split the text on quotes and take every other element starting with the second element:
def text_between_quotes(text):
return text.split('"')[1::2]
my_string = 'Hello, "find.me-_/\\" please help and "this quote" here'
my_string.split('"')[1::2] # ['find.me-_/\\', 'this quote']
'"just one quote"'.split('"')[1::2] # ['just one quote']
This assumes you don't have quotes within quotes, and your text doesn't mix quotes or use other quoting characters like `.
You should validate your input. For example, what do you want to do if there's an odd number of quotes, meaning not all the quotes are balanced? You could do something like discard the last item if you have an even number of things after doing the split
def text_between_quotes(text):
split_text = text.split('"')
between_quotes = split_text[1::2]
# discard the last element if the quotes are unbalanced
if len(split_text) % 2 == 0 and between_quotes and not text.endswith('"'):
between_quotes.pop()
return between_quotes
# ['first quote', 'second quote']
text_between_quotes('"first quote" and "second quote" and "unclosed quote')
or raise an error instead.

Use re.search() instead of re.match(). The latter will match only at the beginning of strings (like an implicit ^).

You need re.search(), not re.match() which is anchored to the start of your input string.
Docs here

Related

Is it possible to construct a pattern for python regex with a variable? [duplicate]

I'd like to use a variable inside a regex, how can I do this in Python?
TEXTO = sys.argv[1]
if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed
You have to build the regex as a string:
TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"
if re.search(my_regex, subject, re.IGNORECASE):
etc.
Note the use of re.escape so that if your text has special characters, they won't be interpreted as such.
From python 3.6 on you can also use Literal String Interpolation, "f-strings". In your particular case the solution would be:
if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
...do something
EDIT:
Since there have been some questions in the comment on how to deal with special characters I'd like to extend my answer:
raw strings ('r'):
One of the main concepts you have to understand when dealing with special characters in regular expressions is to distinguish between string literals and the regular expression itself. It is very well explained here:
In short:
Let's say instead of finding a word boundary \b after TEXTO you want to match the string \boundary. The you have to write:
TEXTO = "Var"
subject = r"Var\boundary"
if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
print("match")
This only works because we are using a raw-string (the regex is preceded by 'r'), otherwise we must write "\\\\boundary" in the regex (four backslashes). Additionally, without '\r', \b' would not converted to a word boundary anymore but to a backspace!
re.escape:
Basically puts a backslash in front of any special character. Hence, if you expect a special character in TEXTO, you need to write:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
print("match")
NOTE: For any version >= python 3.7: !, ", %, ', ,, /, :, ;, <, =, >, #, and ` are not escaped. Only special characters with meaning in a regex are still escaped. _ is not escaped since Python 3.3.(s. here)
Curly braces:
If you want to use quantifiers within the regular expression using f-strings, you have to use double curly braces. Let's say you want to match TEXTO followed by exactly 2 digits:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
print("match")
if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):
This will insert what is in TEXTO into the regex as a string.
rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)
I find it very convenient to build a regular expression pattern by stringing together multiple smaller patterns.
import re
string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)
Output:
[('begin', 'id1'), ('middl', 'id2')]
I agree with all the above unless:
sys.argv[1] was something like Chicken\d{2}-\d{2}An\s*important\s*anchor
sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"
you would not want to use re.escape, because in that case you would like it to behave like a regex
TEXTO = sys.argv[1]
if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed
you can try another usage using format grammer suger:
re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)
I needed to search for usernames that are similar to each other, and what Ned Batchelder said was incredibly helpful. However, I found I had cleaner output when I used re.compile to create my re search term:
pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)
Output can be printed using the following:
print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.
from re import search, IGNORECASE
def is_string_match(word1, word2):
# Case insensitively function that checks if two words are the same
# word1: string
# word2: string | list
# if the word1 is in a list of words
if isinstance(word2, list):
for word in word2:
if search(rf'\b{word1}\b', word, IGNORECASE):
return True
return False
# if the word1 is same as word2
if search(rf'\b{word1}\b', word2, IGNORECASE):
return True
return False
is_match_word = is_string_match("Hello", "hELLO")
True
is_match_word = is_string_match("Hello", ["Bye", "hELLO", "#vagavela"])
True
is_match_word = is_string_match("Hello", "Bye")
False
here's another format you can use (tested on python 3.7)
regex_str = r'\b(?<=\w)%s\b(?!\w)'%TEXTO
I find it's useful when you can't use {} for variable (here replaced with %s)
You can use format keyword as well for this.Format method will replace {} placeholder to the variable which you passed to the format method as an argument.
if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
# Successful match**strong text**
else:
# Match attempt failed
more example
I have configus.yml
with flows files
"pattern":
- _(\d{14})_
"datetime_string":
- "%m%d%Y%H%M%f"
in python code I use
data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)

Having a problem with Python Regex: Prints "None" when printing "matches". Regex works in tester

I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.
import re
pattern = '''
(?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.match(yelp_html)
This prints None.
There is definitely text inside of yelp_html.
What am I doing wrong?
I see two issues:
You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.
I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).
import re
pattern = r'''
(?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})
(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.finditer(yelp_html)
for item in matches:
print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))

Extracting the last statement in []'s (regex) [duplicate]

This question already has answers here:
Remove text between square brackets at the end of string
(3 answers)
Closed 3 years ago.
I'm trying to extract the last statement in brackets. However my code is returning every statement in brackets plus everything in between.
Ex: 'What [are] you [doing]'
I want '[doing]', but I get back '[are] you [doing]' when I run re.search.
I ran re.search using a regex expression that SHOULD get the last statement in brackets (plus the brackets) and nothing else. I also tried adding \s+ at the beginning hoping that would fix it, but it didn't.
string = '[What] are you [doing]'
m = re.search(r'\[.*?\]$' , string)
print(m.group(0))
I should just get [doing] back, but instead I get the entire string.
re.findall(r'\[(.+?)\]', 'What [are] you [doing]')[-1]
['doing']
According to condition to extract the last statement in brackets:
import re
s = 'What [are] you [doing]'
m = re.search(r'.*(\[[^\[\]]+\])', s)
res = m.group(1) if m else m
print(res) # [doing]
You can use findall and get last index
import re
string = 'What [are] you [doing]'
re.findall("\[\w{1,}]", string)[-1]
Output
'[doing]'
This will also work with the example posted by #MonkeyZeus in comments. If the last value is empty it should not return empty value. For example
string = 'What [are] you []'
Output
'[are]'
You can use a negative lookahead pattern to ensure that there isn't another pair of brackets to follow the matching pair of brackets:
re.search(r'\[[^\]]*\](?!.*\[.*\])', string).group()
or you can use .* to consume all the leading characters until the last possible match:
re.search(r'.*(\[.*?\])', string).group(1)
Given string = 'abc [foo] xyz [bar] 123', both of the above code would return: '[bar]'
This captures bracketed segments with anything in between the brackets (not necessarily letters or digits: any symbols/spaces/etc):
import re
string = '[US 1?] Evaluate any matters identified when testing segment information.[US 2!]'
print(re.findall(r'\[[^]]*\]', string)[-1])
gives
[US 2!]
A minor fix with your regex. You don't need the $ at the end. And also use re.findall rather than re.search
import re
string = 'What [are] you [doing]'
re.findall("\[.*?\]", string)[-1]
Output:
'[doing]'
If you have empty [] in your string, it will also be counted in the output by above method. To solve this, change the regex from \[.*?\] to \[..*?\]
import re
string = "What [are] you []"
re.findall("\[..*?\]", string)[-1]
Output:
'[are]'
If there is no matching, it will throw error like all other answers, so you will have to use try and except

Regex negative lookahead ignoring comments

I'm having problems with this regex. I want to pull out just MATCH3, because the others, MATCH1 and MATCH2 are commented out.
# url(r'^MATCH1/$',),
#url(r'^MATCH2$',),
url(r'^MATCH3$',), # comment
The regex I have captures all of the MATCH's.
(?<=url\(r'\^)(.*?)(?=\$',)
How do I ignore lines beginning with a comment? With a negative lookahead? Note the # character is not necessarily at the start of the line.
EDIT: sorry, all answers are good! the example forgot a comma after the $' at the end of the match group.
You really don't need to use lookarounds here, you could look for possible leading whitespace and then match "url" and the preceding context; capturing the part you want to retain.
>>> import re
>>> s = """# url(r'^MATCH1/$',),
#url(r'^MATCH2$',),
url(r'^MATCH3$',), # comment"""
>>> re.findall(r"(?m)^\s*url\(r'\^([^$]+)", s)
['MATCH3']
^\s*#.*$|(?<=url\(r'\^)(.*?)(?=\$'\))
Try this.Grab the capture.See demo.
https://www.regex101.com/r/rK5lU1/37
import re
p = re.compile(r'^\s*#.*$|(?<=url\(r\'\^)(.*?)(?=\$\'\))', re.IGNORECASE | re.MULTILINE)
test_str = "# url(r'^MATCH1/$'),\n #url(r'^MATCH2$'),\n url(r'^MATCH3$') # comment"
re.findall(p, test_str)
If this is the only place where you need to match, then match beginning of line followed by optional whitespace followed by url:
(?m)^\s*url\(r'(.*?)'\)
If you need to cover more complicated cases, I'd suggest using ast.parse instead, as it truly understands the Python source code parsing rules.
import ast
tree = ast.parse("""(
# url(r'^MATCH1/$'),
#url(r'^MATCH2$'),
url(r'^MATCH3$') # comment
)""")
class UrlCallVisitor(ast.NodeVisitor):
def visit_Call(self, node):
if getattr(node.func, 'id', None) == 'url':
if node.args and isinstance(node.args[0], ast.Str):
print(node.args[0].s.strip('$^'))
self.generic_visit(node)
UrlCallVisitor().visit(tree)
prints each first string literal argument given to function named url; in this case, it prints MATCH3. Notice that the source for ast.parse needs to be a well-formed Python source code (thus the parenthesis, otherwise a SyntaxError is raised).
As an alternative you can split your lines with '#' if the first element has 'url' in (it doesn't start with # ) you can use re.search to match the sub-string that you want :
>>> [re.search(r"url\(r'\^(.*?)\$'" ,i[0]).group(1) for i in [line.split('#') for line in s.split('\n')] if 'url' in i[0]]
['MATCH3']
Also note that you dont need to sue look-around for your pattern you can just use grouping!

Python Regular Expression Matching: ## ##

I'm searching a file line by line for the occurrence of ##random_string##. It works except for the case of multiple #...
pattern='##(.*?)##'
prog=re.compile(pattern)
string='lala ###hey## there'
result=prog.search(string)
print re.sub(result.group(1), 'FOUND', string)
Desired Output:
"lala #FOUND there"
Instead I get the following because its grabbing the whole ###hey##:
"lala FOUND there"
So how would I ignore any number of # at the beginning or end, and only capture "##string##".
To match at least two hashes at either end:
pattern='##+(.*?)##+'
Your problem is with your inner match. You use ., which matches any character that isn't a line end, and that means it matches # as well. So when it gets ###hey##, it matches (.*?) to #hey.
The easy solution is to exclude the # character from the matchable set:
prog = re.compile(r'##([^#]*)##')
Protip: Use raw strings (e.g. r'') for regular expressions so you don't have to go crazy with backslash escapes.
Trying to allow # inside the hashes will make things much more complicated.
EDIT: If you do not want to allow blank inner text (i.e. "####" shouldn't match with an inner text of ""), then change it to:
prog = re.compile(r'##([^#]+)##')
+ means "one or more."
'^#{2,}([^#]*)#{2,}' -- any number of # >= 2 on either end
be careful with using lazy quantifiers like (.*?) because it'd match '##abc#####' and capture 'abc###'. also lazy quantifiers are very slow
Try the "block comment trick": /##((?:[^#]|#[^#])+?)##/
Adding + to regex, which means to match one or more character.
pattern='#+(.*?)#+'
prog=re.compile(pattern)
string='###HEY##'
result=prog.search(string)
print result.group(1)
Output:
HEY
have you considered doing it non-regex way?
>>> string='lala ####hey## there'
>>> string.split("####")[1].split("#")[0]
'hey'
>>> import re
>>> text= 'lala ###hey## there'
>>> matcher= re.compile(r"##[^#]+##")
>>> print matcher.sub("FOUND", text)
lala #FOUND there
>>>

Categories

Resources