python regex issue with underscore - python

i am trying to do some string search with regular expressions, where i need to print the [a-z,A-Z,_] only if they end with " " space, but i am having some trouble if i have underscore at the end then it doesn't wait for the space and executes the command.
if re.search(r".*\s\D+\s", string):
print string
if i keep
string = "abc shot0000 "
it works fine, i do need it to execute it only when the string ends with a space \s.
but if i keep
string = "abc shot0000 _"
then it doesn't wait for the space \s and executes the command.

You're using search and this function, as the name says, search in your string if the pattern appear and that's the case in your two strings.
You should add a $ to your regular expression to search for the end of string:
if re.search(r".*\s\D+\s$", string):
print string

You need to anchor the RE at the end of the string with $:
if re.search(r".*\s\D+\s$", string):
print string

Use a $:
>>> strs = "abc shot0000 "
>>> re.search(r"\s\w+\s$", strs) #use \w: it'll handle A-Za-z_
<_sre.SRE_Match object at 0xa530100>
>>> strs = "abc shot0000 _"
>>> re.search(r"\s\w+\s$", strs)
#None

Related

Add text into string between single quotes using regexp

I am trying to use Python to add some escape characters into a string when I print to the terminal.
import re
string1 = "I am a test string"
string2 = "I have some 'quoted text' to display."
string3 = "I have 'some quotes' plus some more text and 'some other quotes'.
pattern = ... # I do not know what kind of pattern to use here
I then want to add the console color escape (\033[92m for green and \033[0m to end the escape sequence) and end characters at the beginning and end of the quoted string using something like this:
result1 = re.sub(...)
result2 = re.sub(...)
result3 = re.sub(...)
with the end result looking like:
result1 = "I am a test string"
result2 = "I have some '\033[92mquoted text\033[0m' to display."
result3 = "I have '\033[92msome quotes\033[0m' plus some more text and '\033[92msome other quotes\033[0m'.
What kind of pattern should I use to do this, and is re.sub an appropriate method for this, or is there a better regex function?
You could use a capturing group to capture negated ' within single quotes.
res = re.sub(r"'([^']*)'", r"'\033[92m\1\033[0m'", s)
See this demo at regex101 or a Python demo at tio.run (\1 refers first group)

Replacing everything with a backslash till next white space

As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.
First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'
'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)

Python regex not working with special characters

SOLVED: it replaced the " symbols in the file with ' (in the data strings)
Do you know a way to only search for 1 or more words (not numbers) between [" and \n?
This works on regexr.com, but not in python
https://regexr.com/3tju7
¨
(?<=\[\")(\D+)(?=\\n)
"S": ["Something\n13/8-2018 09:00 to 11:30
¨
Python code:
re.search('(?<=[\")(\D+)(?=\n)', str(data))
I think \[, \" and \\n is the problem, I have tried to use raw in python
re.search('(?<=\[\")(\D+)(?=\\n)', '"S": ["Something\n13/8-201809:00 to 11:30').group()
This worked but I have to use "data" because I have multiple strings, and it won't let me use .group() on that.
Error: AttributeError: 'NoneType' object has no attribute 'group'
Your problem is that the \n is being interpreted as a newline, instead of the literal characters \ and n. You can use a simpler regex, \["([\w\s]+)$, along with the MULTILINE flag, without modifying the data.
>>> import re
>>> data = '"S": ["Something\n13/8-201809:00 to 11:30'
>>> pattern = '\["([\w\s]+)$'
>>> m = re.search(pattern, data, re.MULTILINE)
>>> m.group(1)
'Something'
Try to put a r before the string with the pattern, that marks the string as "raw". This stops python from evaluating escaped characters before passing them to the function
re.search(r'\search', string)
Or:
rgx = re.compile(r'pattern')
rgx.search(string)

Python RE - Regular expression for matching a printf-like format string with escaped quotation marks

I am writing a little C++ preprocessor in python, which should find printf-like format strings. What I need is a regular expression, which matches from the first to the second quotation mark, but ignoring all escaped quotation marks in between ('\"'). Here's an example:
foo(bar, "Value of \"s\" is: %s", "foobar");
I need a regex for:
"Value of \"s\" is: %s"
What I have so far is this:
(".*?")
But I haven't found a way to ignore the escaped quotation marks. I'm new to this. I would be very grateful, if someone could give me a solution/tip.
Thanks in advance!
You could try the below regex to match the first all the chars between the first and the second ",
\".*?[^\\]\"
DEMO
>>> s = r'foo(bar, "Value of \"s\" is: %s", "foobar");'
>>> m = re.search(r'".*?[^\\]"', s)
>>> result = m.group(0)
>>> print result
"Value of \"s\" is: %s"
Explanation:
" Matches the first double quotes.
.*? Matches any charcter zero or more times. ? after * does a reluctant match.
[^\\]" Matches upto the "(double quotes) which is not preceded by \ symbol.

Easiest way to replace a substring

What would be the easiest way to replace a substring within a string when I don't know the exact substring I am looking for and only know the delimiting strings? For example, if I have the following:
mystr = 'wordone wordtwo "In quotes"."Another word"'
I basically want to delete the first quoted words (including the quotes) and the period (.) following so the resulting string is:
'wordone wordtwo "Another word"'
Basically I want to delete the first quoted words and the quotes and the following period.
You are looking for regular expressions here, using the re module:
import re
quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
result = quoted_plus_fullstop.sub('', mystr)
The pattern matches a literal quote, followed by 1 or more characters that are not quotes, followed by another quote and a full stop.
Demo:
>>> import re
>>> mystr = 'wordone wordtwo "In quotes"."Another word"'
>>> quoted_plus_fullstop = re.compile(r'"[^"]+"\.')
>>> quoted_plus_fullstop.sub('', mystr)
'wordone wordtwo "Another word"'

Categories

Resources