Get words in parenthesis as a group regex

Get words in parenthesis as a group regex - python

String1: {{word1|word2|word3 (word4 word5)|word6}}
String2: {{word1|word2|word3|word6}}
With this regex sentence:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?=\}\})
I capture String2 as groups. How can I change the regex sentence to capture (word4 word5) also as a group?

You can add a (?:\s*(\([^()]*\)))? subpattern:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\([^()]*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See the regex demo.
The (?:\s*(\([^()]*\)))? part is an optional non-capturing group that matches one or zero occurrences of
\s* - zero or more whitespaces
( - start of a capturing group:
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char
) - end of the group.
If you need to make sure only whitespace separated words are allowed inside parentheses, replace [^()]* with \w+(?:\s+\w+)* and insert (?:\s*(\(\w+(?:\s+\w+)*\)))?:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\(\w+(?:\s+\w+)*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See this regex demo.

You could simplify the expression by matching the desired substrings rather than capturing them. For that you could use the following regular expression.
(?<=[{| ])\w+(?=[}| ])|\([\w ]+\)
Regex demo <¯\(ツ)/¯> Python demo
The elements of the expression are as follows.
(?<= # begin a positive lookbehind
[{| ] # match one of the indicated characters
) # end the positive lookbehind
\w+ # match one or more word characters
(?= # begin a positive lookahead
[}| ] # match one of the indicated characters
) # end positive lookahead
| # or
\( # match character
[\w ]+ # match one or more of the indicated characters
\) # match character
Note that this does not validate the format of the string.

Related

Optional group except when it precede with a match

I want to match any string that starts with . and word and then optionally any character after a space.
r"^\.(\w+)(?:\s+(.+)\b)?"
eg:
should match
.just one two
.just
.blah one#nine
.blah
.jargon blah
should not match
.jargon
I want this second group mandatory if first group is jargon

Using Python you can exclude matching only jargon using a negative lookahead, and then match 1 or more word characters
Then optionally match 1 or more whitespace characters excluding newlines followed by at least 1 or more characters without newlines.
^\.(?!jargon$)\w+(?:[^\S\n]+.+)?$
The pattern matches:
^ Start of string
\. Match a dot
(?!jargon$) Exlude matching jargon as the only word on the line
\w+ Match 1+ word characters
(?: Non capture group
[^\S\n]+.+ match 1+ whitespace chars excluding newline and then 1+ chars except newlines
)? Close non capture group and make it optional
$ End of string
See a regex demo and a Python demo.
Example
import re
strings = [
".just one two",
".just",
".blah one#nine",
".blah",
".jargon blah",
".jargon"
]
for s in strings:
m = re.match(r"\.(?!jargon$)\w+(?:[^\S\n]+.+)?$", s)
if m:
print(m.group())
Output
.just one two
.just
.blah one#nine
.blah
.jargon blah

One approach would be to phrase your requirement using an alternation:
^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$
This pattern says to match:
^ from the start of the input
\. match dot
(?:
(?!jargon\b)\w+ match a first term which is NOT "jargon"
(?: \S+)* then match optional following terms zero or more times
| OR
jargon match "jargon" as the first term
(?: \S+)+ then match mandatory one or more terms
)
$ end of the input
Here is a sample Python script:
inp = [".just one two", ".just", ".blah one#nine", ".blah", ".jargon blah", "jargon"]
matches = [x for x in inp if re.search(r'^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$', x)]
print(matches) # ['.just one two', '.just', '.blah one#nine', '.blah', '.jargon blah']

You could attempt to match the following regular expression:
^\.(?!jargon$)\w+(?= .|$).*
Demo
If successful, this will match the entire string. If one simply wants to know if the string conforms to the requirements .* can be dropped.
(?!jargon$) is a negative lookahead that asserts that the period is not immediately followed by 'jargon' at the end of the string.
(?= .|$) is a positive lookahead that asserts that the string of word characters is followed by a space followed by any character or they terminate the string.

What does this regex pattern match?

regex = re.compile(r"\s*[-*+]\s*(.+)")
Especially this part: \s*[-*+]
I want to match this string:
[John](person)is good and [Mary](person) is good too.
But it fails.
Does the \s*[-*+] mean the following:
matches an optional space, followed by one of the characters: -, *, +
This is in Python.

Pattern \s*[-*+]\s*(.+) means:
\s* - match zero or more whitesapces
[-*+] - match one characters from the set: - or * or +
(.+) - match one or more of any characters and store it inside capturing group (. means any character and brackets denote capturing group)
In your sentence, pattern won't match anything due to lack of any of characters from the set -*+.
It would match, for example * (person) is good too. in
[John](person)is good and [Mary] * (person) is good too.
Demo
In order to match names and their description in brackets use \[([^\]]+)\]\(([^)]+)
Explanation:
\[ - match [ literally
([^\]]+) - match one or more characters other from ] and store it in first captuirng group
\] - match [ literally
\( - match ( literally
([^)]+) - match one or more characters other from )
Demo

Regex complete words pattern

I want to get patterns involving complete words, not pieces of words.
E.g. 12345 [some word] 1234567 [some word] 123 1679. Random text and the pattern appears again 1111 123 [word] 555.
This should return
[[12345, 1234567, 123, 1679],[1111, 123, 555]]
I am only tolerating one word between the numbers otherwise the whole string would match.
Also note that it is important to capture that 2 matches were found and so a two-element list was returned.
I am running this in python3.
I have tried:
\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b
but I am not sure how to scale this to an unrestricted number of matches.
re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', string)
This matches [number] [word] [number] but not any number that might follow with or without a word in between.

Are you expecting re.findall() to return a list of lists? It will only return a list - no matter what regex you use.
One approach is to split your input string into sentences and then loop through them
import re
inputArray = re.split('<pattern>',inputText)
outputArray = []
for item in inputArray:
outputArray.append(re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', item))
the trick is to find a <pattern> to split your input.

You can't do this in one operation with the Python re engine.
But you could match the sequence with one match, then extract the
digits with another.
This matches the sequence
r"(?<!\w)\d+(?:(?:[^\S\r\n]+[a-zA-Z](?:\w*[a-zA-Z])*)?[^\S\r\n]+\d+)*(?!\w)"
https://regex101.com/r/73AYLU/1
Explained
(?<! \w ) # Not a word behind
\d+ # Many digits
(?: # Optional word block
(?: # Optional words
[^\S\r\n]+ # Horizontal whitespace
[a-zA-Z] # Starts with a letter
(?: \w* [a-zA-Z] )* # Can be digits in middle, ends with a letter
)? # End words, do once
[^\S\r\n]+ # Horizontal whitespace
\d+ # Many digits
)* # End word block, do many times
(?! \w ) # Not a word ahead
This gets the array of digits from the sequence matched above (use findall)
r"(?<!\S)(\d+)(?!\S)"
https://regex101.com/r/BHov38/1
Explained
(?<! \S ) # Whitespace boundary
( \d+ ) # (1)
(?! \S ) # Whitespace boundary

This is a bit complicated, maybe this expression would be just something to look into:
(((\d+)\s*)*(?:\s*\[.*?\]\s*)((\d+)\s*)*)|([A-za-z\s]+)
and script the rest of the problem for a valid solution.
Demo

Match word boundary before non-alphanumerical character

I want to find words starting with a single non-alphanumerical character, say '$', in a string with re.findall
Example of matching words
$Python
$foo
$any_word123
Example of non-matching words
$$Python
foo
foo$bar
Why \b does not work
If the first character were to be alphanumerical, I could do this.
re.findall(r'\bA\w+', s)
But this does not work for a pattern like \b\$\w+ because \b matches the empty string only between a \w and a \W.
# The line below matches only the last '$baz' which is the one that should not be matched
re.findall(r'\b\$\w+', '$foo $bar x$baz').
The above outputs ['$baz'], but the desired pattern should output ['$foo', '$bar'].
I tried replacing \b by a positive lookbehind with pattern ^|\s, but this does not work because lookarounds must be fixed in length.
What is the correct way to handle this pattern?

The following will match a word starting with a single non-alphanumerical character.
re.findall(r'''
(?: # start non-capturing group
^ # start of string
| # or
\s # space character
) # end non-capturing group
( # start capturing group
[^\w\s] # character that is not a word or space character
\w+ # one or more word characters
) # end capturing group
''', s, re.X)
or just:
re.findall(r'(?:^|\s)([^\w\s]\w+)', s, re.X)
results in:
'$a $b a$c $$d' -> ['$a', '$b']

One way is to use a negative lookbehind with the non-whitespace metacharacter \S.
s = '$Python $foo foo$bar baz'
re.findall(r'(?<!\S)\$\w+', s) # output: ['$Python', '$foo']

Python regex matching all but last occurrence

So I have expression such as "./folder/thisisa.test/file.cxx.h" How do I substitute/remove all the "." but the last dot?

To match all but the last dot with a regex:
'\.(?=[^.]*\.)'
Using a lookahead to check that's there another dot after the one we found (the lookahead's not part of the match).

Without regular expressions, using str.count and str.replace:
s = "./folder/thisisa.test/file.cxx.h"
s.replace('.', '', s.count('.')-1)
# '/folder/thisisatest/filecxx.h'

Specific one-char solution
In your current scenario, you may use
text = re.sub(r'\.(?![^.]*$)', '', text)
Here, \.(?![^.]*$) matches a . (with \.) that is not immediately followed ((?!...)) with any 0+ chars other than . (see [^.]*) up to the end of the string ($).
See the regex demo and the Python demo.
Generic solution for 1+ chars
In case you want to replace a . and any more chars you may use a capturing group around a character class with the chars you need to match and add the positive lookahead with .* and a backreference to the captured value.
Say, you need to remove the last occurrence of [, ], ^, \, /, - or . you may use
([][^\\./-])(?=.*\1)
See the regex demo.
Details
([][^\\./-]) - a capturing group matching ], [, ^, \, ., /, - (note the order of these chars is important: - must be at the end, ] must be at the start, ^ should not be at the start and \ must be escaped)
(?=.*\1) - a positive lookahead that requires any 0+ chars as many as possible and then the value captured in Group 1.
Python sample code:
import re
text = r"./[\folder]/this-is-a.test/fi^le.cxx.LAST[]^\/-.h"
text = re.sub(r'([][^\\./-])(?=.*\1)', '', text, flags=re.S)
print(text)
Mind the r prefix with string literals. Note that flags=re.S will make . match any linebreak sequences.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get words in parenthesis as a group regex - python

Related

Optional group except when it precede with a match

What does this regex pattern match?

Regex complete words pattern

Match word boundary before non-alphanumerical character

Python regex matching all but last occurrence

Categories

Resources