surrounding a pattern in python string with brackets - python

I have a string looks like this:
oldString="this is my {{string-d}}" => "this is my {{(string-d)}}"
oldString2="this is my second{{ new_string-d }}" => "this is my second{{ (new_string-d) }}"
oldString2="this is my second new_string-d " => "this is my second (new_string-d) "
oldString2="this is my second new[123string]-d " => "this is my second (new[123string]-d) "
I want to add brackets whenever I see "-d" right after it and before the word that is attached to it.
I wrote a code that looks for the pattern "-d" in strings and partition the string after finding the pattern to 3 partitions before "-d", after "-d" and "-d" itself then I check the block before "-d" until I find whitespace or "{" and stop and add brackets. my code looks like this:
P.S. I have many files that I read from them and try to modify the string there the example above is just for demonstrating what I'm trying to do.
if ('-d') in oldString:
p = oldString.partition('-d')
v = p[p.index('-d')-1]
beforeString=''
for i in reversed(v):
if i != ' ' or i != '{':
beforeString=i+beforeString
indexNew = v.index(i)
outPutLine = v[:indexNew]+'('+v[indexNew:]
newString = outPutLine + '-d' + ' )'
print newString
the result of running the code will be:
newString = "(this is my {{string-d )"
as you can see that the starting bracket is before "this" instead of before "string" why is this happening? also, I'm not sure if this is the best way to do this kind of find and replace any suggestions would be much appreciated.

>>> import re
>>> oldString = "this is my {{string-d}}"
>>> oldString2 = "this is my second{{ new_string-d }}"
>>> re.sub(r"(\w*-d)", r"(\1)", oldString)
'this is my {{(string-d)}}'
>>> re.sub(r"(\w*-d)", r"(\1)", oldString2)
'this is my second{{ (new_string-d) }}'
Note that this matches "words" assuming that a word is composed of only letters, numbers, and underscores.
Here's a more thorough breakdown of what's happening:
An r before a string literal means the string is a "raw string". It prevents Python from interpreting characters as an escape sequence. For instance, r"\n" is a slash followed by the letter n, rather than being interpreted as a single newline character. I like to use raw strings for my regex patterns, even though it's not always necessary.
the parentheses surrounding \w*-d is a capturing group. It indicates to the regex engine that the contents of the group should be saved for later use.
the sequence \w means "any alphanumeric character or underscore".
* means "zero or more of the preceding item". \w* together means "zero or more alphanumeric characters or underscores".
-d means "a hyphen followed by the letter d.
All together, (\w*-d) means "zero or more alphanumeric characters or underscores, followed by a hyphen and the letter d. Save all of these characters for later use."
The second string describes what the matched data should be replaced with. "\1" means "the contents of the first captured group". The parentheses are just regular parentheses. All together, (\1) in this context means "take the saved content from the captured group, surround it in parentheses, and put it back into the string".
If you want to match more characters than just alphanumeric and underscore, you can replace \w with whatever collection of characters you want to match.
>>> re.sub(r"([\w\.\[\]]*-d)", r"(\1)", "{{startingHere[zero1].my_string-d }}")
'{{(startingHere[zero1].my_string-d) }}'
If you also want to match words ending with "-d()", you can match a parentheses pair with \(\) and mark it as optional using ?.
>>> re.sub(r"([\w\.\[\]]*-d(\(\))?)", r"(\1)", "{{startingHere[zero1].my_string-d() }}")
'{{(startingHere[zero1].my_string-d()) }}'

If you want the bracketing to only take place inside double curly braces, you need something like this:
re.sub(r'({{\s*)([^}]*-d)(\s*}})', r'\1(\2)\3', s)
Breaking that down a bit:
# the target pattern
r'({{\s*)([^}]*-d)(\s*}})'
# ^^^^^^^ capture group 1, opening {{ plus optional space
# ^^^^^^^^^ capture group 2, non-braces plus -d
# ^^^^^^^ capture 3, spaces plus closing }}
The replacement r'\1(\2)\3' just assembles the groups, with
parenthesis around the middle one.
Putting it together:
import re
def quote_string_d(s):
return re.sub(r'({{\s*)([^}]*-d)(\s*}})', r'\1(\2)\3', s)
print(quote_string_d("this is my {{string-d}}"))
print(quote_string_d("this is my second{{ new_string-d }}"))
print(quote_string_d("this should not be quoted other_string-d "))
Output:
this is my {{(string-d)}}
this is my second{{ (new_string-d) }}
this should not be quoted other_string-d
Note the third instance does not get the parentheses, because it's not inside {{ }}.

Related

Replacing everything with a backslash till next white space

As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.
First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'
'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)

Regex expression with special chars and text

Given :
str_var ='host="dsa.asd.dsc"port="1234"service_nameORdbName="dsa"pass="dsa"user="ewq"'
How to match for example in host's case a stirng that can have abc.dfg.ewq.asd and so on? The data can contain only '.' as special character.
The expression that i got can only match text because w+.
result = re.findall('(\w+)="(\w+)"', str_var)
Expected result :
[('host':'dsa.asd.dsc'), ('port', '1234'), ('service_nameORdbName', 'dsa'), ('pass', 'dsa'), ('user', 'ewq')]
You may either add a . to \w:
result = re.findall('(\w+)="([\w.]+)"', str_var)
Or, match . delimited words with \w+(?:\.\w+)* (one or more word chars followed with 0 or more repetitions of a dot and then one or more word chars):
result = re.findall('(\w+)="(\w+(?:\.\w+)*)"', str_var)
Or, match values in-between double quotes that may contain anything but a double quote inside (with "[^"]*" that matches a ", then zero or more chars other than a double quote and then a ") :
result = re.findall('(\w+)="([^"]+)"', str_var))
See the Python demo.

Regex find content in between single quotes, but only if contains certain word

I want to get the content between single quotes, but only if it contains a certain word (i.e 'sample_2'). It additionally should not match ones with white space.
Input example: (The following should match and return only: ../sample_2/file and sample_2/file)
['asdf', '../sample_2/file', 'sample_2/file', 'example with space', sample_2, sample]
Right now I just have that matched the first 3 items in the list:
'(.\S*?)'
I can't seem to find the right regex that would return those containing the word 'sample_2'
If you want specific words/characters you need to have them in the regular expression and not use the '\S'. The \S is the equivalent to [^\r\n\t\f\v ] or "any non-whitespace character".
import re
teststr = "['asdf', '../sample_2/file', 'sample_2/file', 'sample_2 with spaces','example with space', sample_2, sample]"
matches = re.findall(r"'([^\s']*sample_2[^\s]*?)',", teststr)
# ['../sample_2/file', 'sample_2/file']
Based on your wording, you suggest the desired word can change. In that case, I would recommend using re.compile() to dynamically create a string which then defines the regular expression.
import re
word = 'sample_2'
teststr = "['asdf', '../sample_2/file', 'sample_2/file', ' sample_2 with spaces','example with space', sample_2, sample]"
regex = re.compile("'([^'\\s]*"+word+"[^\\s]*?)',")
matches = regex.findall(teststr)
# ['../sample_2/file', 'sample_2/file']
Also if you haven't heard of this tool yet, check out regex101.com. I always build my regular expressions here to make sure I get them correct. It gives you the references, explanation of what is happening and even lets you test it right there in the browser.
Explanation of regex
regex = r"'([^\s']*sample_2[^\s]*?)',"
Find first apostrophe, start group capture. Capture anything except a whitespace character or the corresponding ending apostrophe. It must see the letters "sample_2" before accepting any non-whitespace character. Stop group capture when you see the closing apostrophe and a comma.
Note: In python, a string " or ' prepositioned with the character 'r' means the text is compiled as a regular expression. Strings with the character 'r' also do not require double-escape '\' characters.

Regular expression print only the first character

Instead of printing whole "final' sentence It is printing only "p". Can anyone help here?
final = r'print "\n^^^###***===TP test result: $final_verdict===***###^^^\n";'
searchObj = re.compile(r'[\w\s\"\n\^\^\^\#\#\#\*\*\*\=\=\=\w+\s\w+\:\s\$\w+\=\=\=\#\#\#\*\*\*\^\^\^\n\"\;]')
print(searchObj)
y=searchObj.match(final)
if y:
print("Found",y.group())
else:
print("Nothing")
Result:
re.compile('[\\w\\s\\"\\n\\^\\^\\^\\#\\#\\#\\*\\*\\*\\=\\=\\=\\w+\\s\\w+\\:\\s\\$\\w+\\=\\=\\=\\#\\#\\#\\*\\*\\*\\^\\^\\^\\n\\"\\;]')
Found p
You've put square brackets over your regex, this means you defined a character group, you should remove these:
r'\w\s\"\n\^\^\^\#\#\#\*\*\*\=\=\=\w+\s\w+\:\s\$\w+\=\=\=\#\#\#\*\*\*\^\^\^\n\"\;'
By using a character group you say: any of the characters between the square brackets. So [ab] means: a or b, not a followed by b.
Now however your string does not match anymore (it is of course harder to match a sequence than a single character). You can however improve it to:
r'\w\s\"\\n\^\^\^###\*\*\*===\w+\s\w+\s\w+:\s\$\w+===\*\*\*###\^\^\^\\n\";'
# ^ ^^^ ^^^ ^^^^^ ^^^ ^^^ ^
The carrets on the second line show the changes. First of all you do not need to escape # and =, furthermore you specify \n which Python sees as a new line character, but you want to match \n (two characters), so you need to escape the backslash, so \\n; finally you forgot that there are three words before the colon (:).
You can test and modify your regex with this regex101.

Python/Regex splitting a specifically formatted return string

I'm working with a search&replace programming assignment. I'm a student and I'm finding the regex documentation a bit overwhelming (e.g. https://docs.python.org/2/library/re.html), so I'm hoping someone here could explain to me how to accomplish what I'm looking for.
I've used regex to get a list of strings from my document. They all look like this:
%#import fileName (regexStatement)
An actual example:
%#import script_example.py ( *out =(.|\n)*?return out)
Now, I'm wondering how I can split these up so I get the fileName and regexStatements as separate strings. I'd assume using a regex or string split function, but I'm not sure how to make it work on all kinds of variations of %#import fileName (regexstatement). Splitting using parentheses could hit the middle of the regex statement, or if a parentheses is part of the fileName, for instance. The assignment doesn't specify if it should only be able to import from python files, so I don't believe I can use ".py (" as a splitting point before the regex statement either.
I'm thinking something like a regex "%#import " to hit the gap after import, "\..* " to hit the gap after fileName. But I'm not sure how to get rid of the parentheses that encapsule the regex statement, or how to use all of it to actually split the string correctly so i have one variable storing fileName and one storing regexStatement for each entry in my list.
Thanks a lot for your attention!
If the filename can't contain spaces, just split your string on spaces with maxsplit 2:
>>> line.split(' ', 2)
['%#import', 'script_example.py', '( *out =(.|\n)*?return out)']
The maxsplit 2 makes it split only the first two spaces, and leave intact any spaces within the regex. Now you have the filename as the second element and the regex as the third. It's not clear from your statement whether the parentheses are part of the regex or not (i.e., as a capturing group). If not, you can easily remove them by trimming the first and last characters from that part.
If you assign the values like this:
filename, regex = line.split(' ', 2)[1:]
then you can strip the parentheses with:
regex = regex[1:-1]
That should do it nicely
^%#import (\S+) \((.*)\)
or, if the filename may have spaces:
^%#import ((?:(?! \().)+) \((.*)\)
Both expressions contain two groups, one for the file name and one for the contents of the parentheses. Run in multiline mode on the entire file or in normal mode if you work with single lines anyway.
This: ((?:(?! \().)+) breaks down as:
( # group start
(?: # non-capturing group
(?! # negative look-ahead: a position NOT followed by
\( # " ("
) # end look-ahead
. # match any char (this is part of the filename)
)+ # end non-capturing group, repeat
) # end group
The other bits of the expression should be self-explanatory.
import re
line = "%#import script_example.py ( *out =(.|\\n)*?return out)"
pattern = r'^%#import (\S+) \((.*)\)'
match = re.match(pattern, line)
if match:
print "match.group(1) '" + match.group(1) + "'"
print "match.group(2) '" + match.group(2) + "'"
else:
print "No match."
prints
match.group(1) 'script_example.py'
match.group(2) ' *out =(.|\n)*?return out'
For matching something like %#import script_example.py ( *out =(.|\n)*?return out) i suggest :
r'%#impor[\w\W ]+'
DEMO
note that :
\w match any word character [a-zA-Z0-9_]
\W match any non-word character [^a-zA-Z0-9_]
so you can use re.findall() for find all the matches :
import re
re.findall(r'%#impor[\w\W ]+', your_string)

Categories

Resources