How can i retain specific characters in a sentence

How can i retain specific characters in a sentence - python

I want to remove certain words or characters from a sentence with some exceptions using regular expression.
For example- I have a string this is [/.] a string [ra] with [/] something, I want to remove [ra], [/.] but not [/].
I used:
m = re.sub('\[.*?\]','',n)
which works fine, how can I retain this-> [/]

You may use
re.sub(r'\[(?!/])[^][]*]', '', n)
See the regex demo.
Details
\[ - a [ char
(?!/]) - a negative lookahead that fails the match if there is /] immediately to the right of the current location
[^][]* - 0+ chars other than [ and ]
] - a ] char.

Use this pattern \[(?!\/\])[^\]]+\] and replace all matches with empty string.
Explanation: it matches [ with \[, then it assures, that what follows is NOT \], so we don't match [\], it's done with negative lookahead: (?!\/\]), then it matches everything until ] and ] itself with pattern [^\]]+\] ([^\]]+ matches one or more characters other then ]).
Demo

You could use an alternation to capture in a group what you want to keep and match what you want to remove.
result = re.sub(r"(\[/])|\[[^]]+\]", r"\1", n)
Explanation
(\[/])|\[[^]]+\]
(\[/]) Capture [/] in a group
| Or
\[[^]]+\] Match an opening square bracket until a closing square bracket using a negated character class
Replace with the first capturing group \1
Regex demo
Python demo

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))

Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']

For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

Get words in parenthesis as a group regex

String1: {{word1|word2|word3 (word4 word5)|word6}}
String2: {{word1|word2|word3|word6}}
With this regex sentence:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?=\}\})
I capture String2 as groups. How can I change the regex sentence to capture (word4 word5) also as a group?

You can add a (?:\s*(\([^()]*\)))? subpattern:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\([^()]*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See the regex demo.
The (?:\s*(\([^()]*\)))? part is an optional non-capturing group that matches one or zero occurrences of
\s* - zero or more whitespaces
( - start of a capturing group:
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char
) - end of the group.
If you need to make sure only whitespace separated words are allowed inside parentheses, replace [^()]* with \w+(?:\s+\w+)* and insert (?:\s*(\(\w+(?:\s+\w+)*\)))?:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\(\w+(?:\s+\w+)*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See this regex demo.

You could simplify the expression by matching the desired substrings rather than capturing them. For that you could use the following regular expression.
(?<=[{| ])\w+(?=[}| ])|\([\w ]+\)
Regex demo <¯\(ツ)/¯> Python demo
The elements of the expression are as follows.
(?<= # begin a positive lookbehind
[{| ] # match one of the indicated characters
) # end the positive lookbehind
\w+ # match one or more word characters
(?= # begin a positive lookahead
[}| ] # match one of the indicated characters
) # end positive lookahead
| # or
\( # match character
[\w ]+ # match one or more of the indicated characters
\) # match character
Note that this does not validate the format of the string.

regex pattern not matching continuous groups

I am trying the following pattern :
[,;\" ](.+?\/.+?)[\",; ]
in the following string:
['"text/html,application/xhtml+xml,application/xml;q=0.9;q
=0.8"']
It matches the bold text but not the italic one. Why?
I want to extract text/html, application/xhtml+xml and application/xml. It is extracting 1st and 3rd but not the middle one

Your last [,"; ] consumes the , after text/html and thus, at the next iteration, when the regex engine searches for a match, the first [,;" ] cannot match that comma. Hence, you lose one match.
You may turn the trailing [,"; ] into a non-consuming pattern, a positive lookahead, or better, since the matches cannot contain the delimiters, use a negated character class approach:
[,;" ]([^/,;" ]+/[^/,;" ]+)
See the regex demo. If there can be more than 1 / inside the expected matches, remove / char from the second character class.
Details
[,;" ] - a comma, ;, ", or space
([^/,;" ]+/[^/,;" ]+) - Group 1: any one or more chars that is not /, ,. ;, " and space, / and then again any one or more chars that is not /, ,. ;, " and space as many as possible
Python demo:
import re
rx = r'[,;" ]([^/,;" ]+/[^/,;" ]+)'
s = """['"text/html,application/xhtml+xml,application/xml;q=0.9;q =0.8"']"""
res = re.findall(rx, s)
print(res) # => ['text/html', 'application/xhtml+xml', 'application/xml']

python regex get text among two tag with new line

I'm new in regex.Here is my data.
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
I want to get this.
y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38
Here is my regex.
(<p>\[tag(.*)\])(.+)(\[\/tag\]<\/p>)
But it doesn't work because of new line(\n).If I use re.DOTALL , It works ,but if my data has multi records like
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
<p>[tag]y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38[/tag]</p>
re.findall() returns only one match.I briefly want this.
[data1,data2,data3...].What can i do ?

Simple as this:
\](.*?)\[
reobj = re.compile(r"\](.*?)\[", re.IGNORECASE | re.DOTALL | re.MULTILINE)
result = reobj.findall(YOURSTRING)
Output:
y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38
DEMO
Regex Explanation:
\] matches the character ] literally
1st Capturing group (.*?)
.*? matches any character
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\[ matches the character [ literally
s modifier: single line. Dot matches newline characters

You can use a this regex:
\[tag\]([\s\S]*?)\[\/tag\]
Working demo
Match information:
MATCH 1
1. [8-44] `y,m,m,l
1997,f,e,2.34g
2000,m,c,2.38`
Update: what
\[tag\]
([\s\S]*?) --> the [\s\S]*? is used to match everything, since \S will capture
all non blanks and \s will capture blanks. This is just a trick, you can
also use [\D\d] or [\W\w]. Btw, the *? is just a ungreedy quantifier
\[\/tag\]
On the other hand, if you want to allow attributes in the tag you can use:
\[tag.*?\]([\s\S]*?)\[\/tag\]

Python regex matching all but last occurrence

So I have expression such as "./folder/thisisa.test/file.cxx.h" How do I substitute/remove all the "." but the last dot?

To match all but the last dot with a regex:
'\.(?=[^.]*\.)'
Using a lookahead to check that's there another dot after the one we found (the lookahead's not part of the match).

Without regular expressions, using str.count and str.replace:
s = "./folder/thisisa.test/file.cxx.h"
s.replace('.', '', s.count('.')-1)
# '/folder/thisisatest/filecxx.h'

Specific one-char solution
In your current scenario, you may use
text = re.sub(r'\.(?![^.]*$)', '', text)
Here, \.(?![^.]*$) matches a . (with \.) that is not immediately followed ((?!...)) with any 0+ chars other than . (see [^.]*) up to the end of the string ($).
See the regex demo and the Python demo.
Generic solution for 1+ chars
In case you want to replace a . and any more chars you may use a capturing group around a character class with the chars you need to match and add the positive lookahead with .* and a backreference to the captured value.
Say, you need to remove the last occurrence of [, ], ^, \, /, - or . you may use
([][^\\./-])(?=.*\1)
See the regex demo.
Details
([][^\\./-]) - a capturing group matching ], [, ^, \, ., /, - (note the order of these chars is important: - must be at the end, ] must be at the start, ^ should not be at the start and \ must be escaped)
(?=.*\1) - a positive lookahead that requires any 0+ chars as many as possible and then the value captured in Group 1.
Python sample code:
import re
text = r"./[\folder]/this-is-a.test/fi^le.cxx.LAST[]^\/-.h"
text = re.sub(r'([][^\\./-])(?=.*\1)', '', text, flags=re.S)
print(text)
Mind the r prefix with string literals. Note that flags=re.S will make . match any linebreak sequences.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i retain specific characters in a sentence - python

You may use re.sub(r'\[(?!/])[^][]]', '', n) See the regex demo. Details \[ - a [ char (?!/]) - a negative lookahead that fails the match if there is /] immediately to the right of the current location [^][] - 0+ chars other than [ and ] ] - a ] char.

Related

How to split a string with parentheses and spaces into a list

Get words in parenthesis as a group regex

regex pattern not matching continuous groups

python regex get text among two tag with new line

Python regex matching all but last occurrence

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i retain specific characters in a sentence - python

You may use re.sub(r'\[(?!/])[^][]*]', '', n) See the regex demo. Details \[ - a [ char (?!/]) - a negative lookahead that fails the match if there is /] immediately to the right of the current location [^][]* - 0+ chars other than [ and ] ] - a ] char.

Related

How to split a string with parentheses and spaces into a list

Get words in parenthesis as a group regex

regex pattern not matching continuous groups

python regex get text among two tag with new line

Python regex matching all but last occurrence

Categories

Resources

You may use re.sub(r'\[(?!/])[^][]]', '', n) See the regex demo. Details \[ - a [ char (?!/]) - a negative lookahead that fails the match if there is /] immediately to the right of the current location [^][] - 0+ chars other than [ and ] ] - a ] char.