How can i retain specific characters in a sentence - python

I want to remove certain words or characters from a sentence with some exceptions using regular expression.
For example- I have a string this is [/.] a string [ra] with [/] something, I want to remove [ra], [/.] but not [/].
I used:
m = re.sub('\[.*?\]','',n)
which works fine, how can I retain this-> [/]

You may use
re.sub(r'\[(?!/])[^][]*]', '', n)
See the regex demo.
Details
\[ - a [ char
(?!/]) - a negative lookahead that fails the match if there is /] immediately to the right of the current location
[^][]* - 0+ chars other than [ and ]
] - a ] char.

Use this pattern \[(?!\/\])[^\]]+\] and replace all matches with empty string.
Explanation: it matches [ with \[, then it assures, that what follows is NOT \], so we don't match [\], it's done with negative lookahead: (?!\/\]), then it matches everything until ] and ] itself with pattern [^\]]+\] ([^\]]+ matches one or more characters other then ]).
Demo

You could use an alternation to capture in a group what you want to keep and match what you want to remove.
result = re.sub(r"(\[/])|\[[^]]+\]", r"\1", n)
Explanation
(\[/])|\[[^]]+\]
(\[/]) Capture [/] in a group
| Or
\[[^]]+\] Match an opening square bracket until a closing square bracket using a negated character class
Replace with the first capturing group \1
Regex demo
Python demo

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

Get words in parenthesis as a group regex

String1: {{word1|word2|word3 (word4 word5)|word6}}
String2: {{word1|word2|word3|word6}}
With this regex sentence:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?=\}\})
I capture String2 as groups. How can I change the regex sentence to capture (word4 word5) also as a group?
You can add a (?:\s*(\([^()]*\)))? subpattern:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\([^()]*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See the regex demo.
The (?:\s*(\([^()]*\)))? part is an optional non-capturing group that matches one or zero occurrences of
\s* - zero or more whitespaces
( - start of a capturing group:
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char
) - end of the group.
If you need to make sure only whitespace separated words are allowed inside parentheses, replace [^()]* with \w+(?:\s+\w+)* and insert (?:\s*(\(\w+(?:\s+\w+)*\)))?:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\(\w+(?:\s+\w+)*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See this regex demo.
You could simplify the expression by matching the desired substrings rather than capturing them. For that you could use the following regular expression.
(?<=[{| ])\w+(?=[}| ])|\([\w ]+\)
Regex demo <¯\(ツ)/¯> Python demo
The elements of the expression are as follows.
(?<= # begin a positive lookbehind
[{| ] # match one of the indicated characters
) # end the positive lookbehind
\w+ # match one or more word characters
(?= # begin a positive lookahead
[}| ] # match one of the indicated characters
) # end positive lookahead
| # or
\( # match character
[\w ]+ # match one or more of the indicated characters
\) # match character
Note that this does not validate the format of the string.

regex pattern not matching continuous groups

I am trying the following pattern :
[,;\" ](.+?\/.+?)[\",; ]
in the following string:
['"text/html,application/xhtml+xml,application/xml;q=0.9;q
=0.8"']
It matches the bold text but not the italic one. Why?
I want to extract text/html, application/xhtml+xml and application/xml. It is extracting 1st and 3rd but not the middle one
Your last [,"; ] consumes the , after text/html and thus, at the next iteration, when the regex engine searches for a match, the first [,;" ] cannot match that comma. Hence, you lose one match.
You may turn the trailing [,"; ] into a non-consuming pattern, a positive lookahead, or better, since the matches cannot contain the delimiters, use a negated character class approach:
[,;" ]([^/,;" ]+/[^/,;" ]+)
See the regex demo. If there can be more than 1 / inside the expected matches, remove / char from the second character class.
Details
[,;" ] - a comma, ;, ", or space
([^/,;" ]+/[^/,;" ]+) - Group 1: any one or more chars that is not /, ,. ;, " and space, / and then again any one or more chars that is not /, ,. ;, " and space as many as possible
Python demo:
import re
rx = r'[,;" ]([^/,;" ]+/[^/,;" ]+)'
s = """['"text/html,application/xhtml+xml,application/xml;q=0.9;q =0.8"']"""
res = re.findall(rx, s)
print(res) # => ['text/html', 'application/xhtml+xml', 'application/xml']

python regex get text among two tag with new line

I'm new in regex.Here is my data.
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
I want to get this.
y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38
Here is my regex.
(<p>\[tag(.*)\])(.+)(\[\/tag\]<\/p>)
But it doesn't work because of new line(\n).If I use re.DOTALL , It works ,but if my data has multi records like
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
re.findall() returns only one match.I briefly want this.
[data1,data2,data3...].What can i do ?
Simple as this:
\](.*?)\[
reobj = re.compile(r"\](.*?)\[", re.IGNORECASE | re.DOTALL | re.MULTILINE)
result = reobj.findall(YOURSTRING)
Output:
y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38
DEMO
Regex Explanation:
\] matches the character ] literally
1st Capturing group (.*?)
.*? matches any character
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\[ matches the character [ literally
s modifier: single line. Dot matches newline characters
You can use a this regex:
\[tag\]([\s\S]*?)\[\/tag\]
Working demo
Match information:
MATCH 1
1. [8-44] `y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38`
Update: what
\[tag\]
([\s\S]*?) --> the [\s\S]*? is used to match everything, since \S will capture
all non blanks and \s will capture blanks. This is just a trick, you can
also use [\D\d] or [\W\w]. Btw, the *? is just a ungreedy quantifier
\[\/tag\]
On the other hand, if you want to allow attributes in the tag you can use:
\[tag.*?\]([\s\S]*?)\[\/tag\]

Python regex matching all but last occurrence

So I have expression such as "./folder/thisisa.test/file.cxx.h" How do I substitute/remove all the "." but the last dot?
To match all but the last dot with a regex:
'\.(?=[^.]*\.)'
Using a lookahead to check that's there another dot after the one we found (the lookahead's not part of the match).
Without regular expressions, using str.count and str.replace:
s = "./folder/thisisa.test/file.cxx.h"
s.replace('.', '', s.count('.')-1)
# '/folder/thisisatest/filecxx.h'
Specific one-char solution
In your current scenario, you may use
text = re.sub(r'\.(?![^.]*$)', '', text)
Here, \.(?![^.]*$) matches a . (with \.) that is not immediately followed ((?!...)) with any 0+ chars other than . (see [^.]*) up to the end of the string ($).
See the regex demo and the Python demo.
Generic solution for 1+ chars
In case you want to replace a . and any more chars you may use a capturing group around a character class with the chars you need to match and add the positive lookahead with .* and a backreference to the captured value.
Say, you need to remove the last occurrence of [, ], ^, \, /, - or . you may use
([][^\\./-])(?=.*\1)
See the regex demo.
Details
([][^\\./-]) - a capturing group matching ], [, ^, \, ., /, - (note the order of these chars is important: - must be at the end, ] must be at the start, ^ should not be at the start and \ must be escaped)
(?=.*\1) - a positive lookahead that requires any 0+ chars as many as possible and then the value captured in Group 1.
Python sample code:
import re
text = r"./[\folder]/this-is-a.test/fi^le.cxx.LAST[]^\/-.h"
text = re.sub(r'([][^\\./-])(?=.*\1)', '', text, flags=re.S)
print(text)
Mind the r prefix with string literals. Note that flags=re.S will make . match any linebreak sequences.

Categories

Resources