I want to match any string that starts with . and word and then optionally any character after a space.
r"^\.(\w+)(?:\s+(.+)\b)?"
eg:
should match
.just one two
.just
.blah one#nine
.blah
.jargon blah
should not match
.jargon
I want this second group mandatory if first group is jargon
Using Python you can exclude matching only jargon using a negative lookahead, and then match 1 or more word characters
Then optionally match 1 or more whitespace characters excluding newlines followed by at least 1 or more characters without newlines.
^\.(?!jargon$)\w+(?:[^\S\n]+.+)?$
The pattern matches:
^ Start of string
\. Match a dot
(?!jargon$) Exlude matching jargon as the only word on the line
\w+ Match 1+ word characters
(?: Non capture group
[^\S\n]+.+ match 1+ whitespace chars excluding newline and then 1+ chars except newlines
)? Close non capture group and make it optional
$ End of string
See a regex demo and a Python demo.
Example
import re
strings = [
".just one two",
".just",
".blah one#nine",
".blah",
".jargon blah",
".jargon"
]
for s in strings:
m = re.match(r"\.(?!jargon$)\w+(?:[^\S\n]+.+)?$", s)
if m:
print(m.group())
Output
.just one two
.just
.blah one#nine
.blah
.jargon blah
One approach would be to phrase your requirement using an alternation:
^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$
This pattern says to match:
^ from the start of the input
\. match dot
(?:
(?!jargon\b)\w+ match a first term which is NOT "jargon"
(?: \S+)* then match optional following terms zero or more times
| OR
jargon match "jargon" as the first term
(?: \S+)+ then match mandatory one or more terms
)
$ end of the input
Here is a sample Python script:
inp = [".just one two", ".just", ".blah one#nine", ".blah", ".jargon blah", "jargon"]
matches = [x for x in inp if re.search(r'^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$', x)]
print(matches) # ['.just one two', '.just', '.blah one#nine', '.blah', '.jargon blah']
You could attempt to match the following regular expression:
^\.(?!jargon$)\w+(?= .|$).*
Demo
If successful, this will match the entire string. If one simply wants to know if the string conforms to the requirements .* can be dropped.
(?!jargon$) is a negative lookahead that asserts that the period is not immediately followed by 'jargon' at the end of the string.
(?= .|$) is a positive lookahead that asserts that the string of word characters is followed by a space followed by any character or they terminate the string.
Related
String1: {{word1|word2|word3 (word4 word5)|word6}}
String2: {{word1|word2|word3|word6}}
With this regex sentence:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?=\}\})
I capture String2 as groups. How can I change the regex sentence to capture (word4 word5) also as a group?
You can add a (?:\s*(\([^()]*\)))? subpattern:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\([^()]*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See the regex demo.
The (?:\s*(\([^()]*\)))? part is an optional non-capturing group that matches one or zero occurrences of
\s* - zero or more whitespaces
( - start of a capturing group:
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char
) - end of the group.
If you need to make sure only whitespace separated words are allowed inside parentheses, replace [^()]* with \w+(?:\s+\w+)* and insert (?:\s*(\(\w+(?:\s+\w+)*\)))?:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\(\w+(?:\s+\w+)*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See this regex demo.
You could simplify the expression by matching the desired substrings rather than capturing them. For that you could use the following regular expression.
(?<=[{| ])\w+(?=[}| ])|\([\w ]+\)
Regex demo <¯\(ツ)/¯> Python demo
The elements of the expression are as follows.
(?<= # begin a positive lookbehind
[{| ] # match one of the indicated characters
) # end the positive lookbehind
\w+ # match one or more word characters
(?= # begin a positive lookahead
[}| ] # match one of the indicated characters
) # end positive lookahead
| # or
\( # match character
[\w ]+ # match one or more of the indicated characters
\) # match character
Note that this does not validate the format of the string.
Data Set
Cider
631
Spruce
871
Honda
18813
Nissan
3292
Pine
10621
Walnut
10301
Code
#!/usr/bin/python
import re
text = "Cider\n631\n\nSpruce\n871Honda\n18813\n\nNissan\n3292\n\nPine\n10621\n\nWalnut\n10301\n\n"
f1 = re.findall(r"(Cider|Pine)\n(.*)",text)
print(f1)
Current Result
[('Cider', '631'), ('Pine', '10621')]
Question:
How do I change the regex from matching everything except several specified strings? ex (Honda|Nissan)
Desired Result
[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]
You can exclude matching either of the names or only digits, and then match the 2 lines starting with at least a non whitespace char.
^(?!(?:Honda|Nissan|\d+)$)(\S.*)\n(.*)
The pattern matches:
^ Start of string
(?! Negative lookahead, assert not directly to the right
(?:Honda|Nissan|\d+)$ Match any of the alternatives at followed by asserting the end of the string
) Close lookahead
(\S.*) Capture group 1, match a non whitespace char followed by the rest of the line
\n Match a newline
(.*) Capture group 2, match any character except a newline
Regex demo
import re
text = ("Cider\n"
"631\n\n"
"Spruce\n"
"871\n\n"
"Honda\n"
"18813\n\n"
"Nissan\n"
"3292\n\n"
"Pine\n"
"10621\n\n"
"Walnut\n"
"10301")
f1 = re.findall(r"^(?!(?:Honda|Nissan|\d+)$)(\S.*)\n(.*)", text, re.MULTILINE)
print(f1)
Output
[('Cider', '631'), ('Spruce', '871'), ('Pine', '10621'), ('Walnut', '10301')]
If the line should start with an uppercase char A-Z and the next line should consist of only digits:
^(?!Honda|Nissan)([A-Z].*)\n(\d+)$
This pattern matches:
^ Start of string
(?!Honda|Nissan) Negative lookahead, assert not Honda or Nissan directly to the right
([A-Z].*) Capture group 1, match an uppercase char A-Z followed by the rest of the line
\n Match a newline
(\d+) Capture group 2, match 1+ digits
$ End of string
Regex demo
inverse it with caret ‘^’ symbol.
f1 = re.findall(r"(\s?^(Cider|Pine))\n(.*)",text)
Keep in mind that caret symbol (in regex) has a special meaning if it is used as a first character match which then would alternatively mean to be “does it start at the beginning of a line”.
Thats why one would insert a “non-usable character” in the beginning. I chosed an optional single space to use up that first character thereby rendering the meaning of the caret (^) symbol as NOT to mean “the beginning of the line”, but to get the desired inverse operator.
I am trying to extract the name and profession as a list of tuples from the below string using regex.
Input string
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
As you can see the first word is the name followed by the profession which repeats in a comma seperated fashion. The problem is that, I want to get rid of the adjectives that comes along with the profession. For e.g "amazing" in the below example.
Expected output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
I stripped out the adjective from the text using "replace" and used the below code using "regex" to get the output. But I am looking for a single regex function to avoid running the string replace. I figured that this has something to do with look ahead in regex but couldn't make it work. Any help would be appreciated.
text.replace("amazing ", "")
txt_new = re.findall("([\w\s]+),([\w\s]+)",text)
If you only want to use word and whitespace characters, this could be another option:
(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)
Explanation
( Capture group 1
\w+(?:\s+\w+)* Match 1+ word chars and optionally repeat 1+ whitespace chars and 1+ word chars
) Close group 1
\s*,\s* Match a comma between optional whitespace chars
(?:\w+\s+)* Optionally repeat 1+ word and 1+ whitespace chars
(\w+) Capture group 2, match 1+ word chars
Regex demo | Python demo
import re
regex = r"(\w+(?:\s+\w+)*)\s*,\s*(?:\w+\s+)*(\w+)"
s = ("Mr John,Carpenter,Mrs Liza,amazing painter")
print(re.findall(regex, s))
Output
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is one regex approach using re.findall:
text = "Mr John,Carpenter,Mrs Liza,amazing painter"
matches = re.findall(r'\s*([^,]+?)\s*,\s*.*?(\S+)\s*(?![^,])', text)
print(matches)
This prints:
[('Mr John', 'Carpenter'), ('Mrs Liza', 'painter')]
Here is an explanation of the regex pattern:
\s* match optional whitespace
([^,]+?) match the name
\s* optional whitespace
, first comma
\s* optional whitespace
.*? consume all content up until
(\S+) the last profession word
\s* optional whitespace
(?![^,]) assert that what follows is either comma or the end of the input
I'm trying to extract a pattern from string using python regex. But it is not working with the below pattern
headerRegex = re.compile(r'^[^ ]*\s+\d*')
mo = headerRegex.search(string)
return mo.group()
My requirment is regular expression that should start with anything except white space and followed by one or more whitespace then digits occurence one or more
Example
i/p: test 7895 => olp:7895(correct)
i/p: 8545 ==> Not matching
i/p: #### 3453 ==>3453
May I know what is missing in my regex to implement this requirement?
In the pattern that you tried, only matching whitespace chars is mandatory, and you might possibly also match only newlines.
Change the quantifiers to + to match 1+ times, and if you don't want to match newlines as well use [^\S\r\n]+ instead.
If that exact match is only allowed, add an anchor $ to assert the end of the string, or add \Z if there is no newline following allowed.
^\S+[^\S\r\n]+\d+$
^ Start of string
\S+ Match 1+ times a non whitespace char
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
\d+ Match 1+ digits
$ End of string
Regex demo
I want to find words starting with a single non-alphanumerical character, say '$', in a string with re.findall
Example of matching words
$Python
$foo
$any_word123
Example of non-matching words
$$Python
foo
foo$bar
Why \b does not work
If the first character were to be alphanumerical, I could do this.
re.findall(r'\bA\w+', s)
But this does not work for a pattern like \b\$\w+ because \b matches the empty string only between a \w and a \W.
# The line below matches only the last '$baz' which is the one that should not be matched
re.findall(r'\b\$\w+', '$foo $bar x$baz').
The above outputs ['$baz'], but the desired pattern should output ['$foo', '$bar'].
I tried replacing \b by a positive lookbehind with pattern ^|\s, but this does not work because lookarounds must be fixed in length.
What is the correct way to handle this pattern?
The following will match a word starting with a single non-alphanumerical character.
re.findall(r'''
(?: # start non-capturing group
^ # start of string
| # or
\s # space character
) # end non-capturing group
( # start capturing group
[^\w\s] # character that is not a word or space character
\w+ # one or more word characters
) # end capturing group
''', s, re.X)
or just:
re.findall(r'(?:^|\s)([^\w\s]\w+)', s, re.X)
results in:
'$a $b a$c $$d' -> ['$a', '$b']
One way is to use a negative lookbehind with the non-whitespace metacharacter \S.
s = '$Python $foo foo$bar baz'
re.findall(r'(?<!\S)\$\w+', s) # output: ['$Python', '$foo']