I am trying to use Python to add some escape characters into a string when I print to the terminal.
import re
string1 = "I am a test string"
string2 = "I have some 'quoted text' to display."
string3 = "I have 'some quotes' plus some more text and 'some other quotes'.
pattern = ... # I do not know what kind of pattern to use here
I then want to add the console color escape (\033[92m for green and \033[0m to end the escape sequence) and end characters at the beginning and end of the quoted string using something like this:
result1 = re.sub(...)
result2 = re.sub(...)
result3 = re.sub(...)
with the end result looking like:
result1 = "I am a test string"
result2 = "I have some '\033[92mquoted text\033[0m' to display."
result3 = "I have '\033[92msome quotes\033[0m' plus some more text and '\033[92msome other quotes\033[0m'.
What kind of pattern should I use to do this, and is re.sub an appropriate method for this, or is there a better regex function?
You could use a capturing group to capture negated ' within single quotes.
res = re.sub(r"'([^']*)'", r"'\033[92m\1\033[0m'", s)
See this demo at regex101 or a Python demo at tio.run (\1 refers first group)
Related
I'd like to use a variable inside a regex, how can I do this in Python?
TEXTO = sys.argv[1]
if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed
You have to build the regex as a string:
TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"
if re.search(my_regex, subject, re.IGNORECASE):
etc.
Note the use of re.escape so that if your text has special characters, they won't be interpreted as such.
From python 3.6 on you can also use Literal String Interpolation, "f-strings". In your particular case the solution would be:
if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
...do something
EDIT:
Since there have been some questions in the comment on how to deal with special characters I'd like to extend my answer:
raw strings ('r'):
One of the main concepts you have to understand when dealing with special characters in regular expressions is to distinguish between string literals and the regular expression itself. It is very well explained here:
In short:
Let's say instead of finding a word boundary \b after TEXTO you want to match the string \boundary. The you have to write:
TEXTO = "Var"
subject = r"Var\boundary"
if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
print("match")
This only works because we are using a raw-string (the regex is preceded by 'r'), otherwise we must write "\\\\boundary" in the regex (four backslashes). Additionally, without '\r', \b' would not converted to a word boundary anymore but to a backspace!
re.escape:
Basically puts a backslash in front of any special character. Hence, if you expect a special character in TEXTO, you need to write:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
print("match")
NOTE: For any version >= python 3.7: !, ", %, ', ,, /, :, ;, <, =, >, #, and ` are not escaped. Only special characters with meaning in a regex are still escaped. _ is not escaped since Python 3.3.(s. here)
Curly braces:
If you want to use quantifiers within the regular expression using f-strings, you have to use double curly braces. Let's say you want to match TEXTO followed by exactly 2 digits:
if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
print("match")
if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):
This will insert what is in TEXTO into the regex as a string.
rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)
I find it very convenient to build a regular expression pattern by stringing together multiple smaller patterns.
import re
string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)
Output:
[('begin', 'id1'), ('middl', 'id2')]
I agree with all the above unless:
sys.argv[1] was something like Chicken\d{2}-\d{2}An\s*important\s*anchor
sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"
you would not want to use re.escape, because in that case you would like it to behave like a regex
TEXTO = sys.argv[1]
if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
# Successful match
else:
# Match attempt failed
you can try another usage using format grammer suger:
re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)
I needed to search for usernames that are similar to each other, and what Ned Batchelder said was incredibly helpful. However, I found I had cleaner output when I used re.compile to create my re search term:
pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)
Output can be printed using the following:
print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.
from re import search, IGNORECASE
def is_string_match(word1, word2):
# Case insensitively function that checks if two words are the same
# word1: string
# word2: string | list
# if the word1 is in a list of words
if isinstance(word2, list):
for word in word2:
if search(rf'\b{word1}\b', word, IGNORECASE):
return True
return False
# if the word1 is same as word2
if search(rf'\b{word1}\b', word2, IGNORECASE):
return True
return False
is_match_word = is_string_match("Hello", "hELLO")
True
is_match_word = is_string_match("Hello", ["Bye", "hELLO", "#vagavela"])
True
is_match_word = is_string_match("Hello", "Bye")
False
here's another format you can use (tested on python 3.7)
regex_str = r'\b(?<=\w)%s\b(?!\w)'%TEXTO
I find it's useful when you can't use {} for variable (here replaced with %s)
You can use format keyword as well for this.Format method will replace {} placeholder to the variable which you passed to the format method as an argument.
if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
# Successful match**strong text**
else:
# Match attempt failed
more example
I have configus.yml
with flows files
"pattern":
- _(\d{14})_
"datetime_string":
- "%m%d%Y%H%M%f"
in python code I use
data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)
As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.
First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'
'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)
I have a sentence in which every token has a / in it. I want to just print what I have before the slash.
What I have now is basic:
text = less/RBR.....
return re.findall(r'\b(\S+)\b', text)
This obviously just prints the text, how do I cut off the words before the /?
Assuming you want all characters before the slash out of every word that contains a slash. This would mean e.g. for the input string match/this but nothing here but another/one you would want the results match and another.
With regex:
import re
result = re.findall(r"\b(\w*?)/\w*?\b", my_string)
print(result)
Without regex:
result = [word.split("/")[0] for word in my_string.split()]
print(result)
Simple and straight-forward:
rx = r'^[^/]+'
# anchor it to the beginning
# the class says: match everything not a forward slash as many times as possible
In Python this would be:
import re
text = "less/RBR....."
print re.match(r'[^/]+', text)
As this is an object, you'd probably like to print it out, like so:
print re.match(r'[^/]+', text).group(0)
# less
This should also work
\b([^\s/]+)(?=/)\b
Python Code
p = re.compile(r'\b([^\s/]+)(?=/)\b')
test_str = "less/RBR/...."
print(re.findall(p, test_str))
Ideone Demo
i am trying to do some string search with regular expressions, where i need to print the [a-z,A-Z,_] only if they end with " " space, but i am having some trouble if i have underscore at the end then it doesn't wait for the space and executes the command.
if re.search(r".*\s\D+\s", string):
print string
if i keep
string = "abc shot0000 "
it works fine, i do need it to execute it only when the string ends with a space \s.
but if i keep
string = "abc shot0000 _"
then it doesn't wait for the space \s and executes the command.
You're using search and this function, as the name says, search in your string if the pattern appear and that's the case in your two strings.
You should add a $ to your regular expression to search for the end of string:
if re.search(r".*\s\D+\s$", string):
print string
You need to anchor the RE at the end of the string with $:
if re.search(r".*\s\D+\s$", string):
print string
Use a $:
>>> strs = "abc shot0000 "
>>> re.search(r"\s\w+\s$", strs) #use \w: it'll handle A-Za-z_
<_sre.SRE_Match object at 0xa530100>
>>> strs = "abc shot0000 _"
>>> re.search(r"\s\w+\s$", strs)
#None
In the following script I would like to pull out text between the double quotes ("). However, the python interpreter is not happy and I can't figure out why...
import re
text = 'Hello, "find.me-_/\\" please help with python regex'
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
m = re.match(pattern, text)
print m.group()
The output should be find.me-/\.
match starts searching from the beginning of the text.
Use search instead:
#!/usr/bin/env python
import re
text = 'Hello, "find.me-_/\\" please help with python regex'
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
m = re.search(pattern, text)
print m.group()
match and search return None when they fail to match.
I guess you are getting AttributeError: 'NoneType' object has no attribute 'group' from python: This is because you are assuming you will match without checking the return from re.match.
If you write:
m = re.search(pattern, text)
match: searches at the beginning of text
search: searches all the string
Maybe this helps you to understand:
http://docs.python.org/library/re.html#matching-vs-searching
Split the text on quotes and take every other element starting with the second element:
def text_between_quotes(text):
return text.split('"')[1::2]
my_string = 'Hello, "find.me-_/\\" please help and "this quote" here'
my_string.split('"')[1::2] # ['find.me-_/\\', 'this quote']
'"just one quote"'.split('"')[1::2] # ['just one quote']
This assumes you don't have quotes within quotes, and your text doesn't mix quotes or use other quoting characters like `.
You should validate your input. For example, what do you want to do if there's an odd number of quotes, meaning not all the quotes are balanced? You could do something like discard the last item if you have an even number of things after doing the split
def text_between_quotes(text):
split_text = text.split('"')
between_quotes = split_text[1::2]
# discard the last element if the quotes are unbalanced
if len(split_text) % 2 == 0 and between_quotes and not text.endswith('"'):
between_quotes.pop()
return between_quotes
# ['first quote', 'second quote']
text_between_quotes('"first quote" and "second quote" and "unclosed quote')
or raise an error instead.
Use re.search() instead of re.match(). The latter will match only at the beginning of strings (like an implicit ^).
You need re.search(), not re.match() which is anchored to the start of your input string.
Docs here