How do I "not" select first character in a regex pattern? - python

I am a RegEx beginner and trying to identify the endings of different statements in sms. See screenshot below.
How can I avoid selecting the next letter following by a full-stop that indicates ending of a statement.
Note that some statements have <.><Alphabets> while some have <.><space><Alphabets>
Regex used: r"\. ?[\D]"
Sample SMS: - I want to select just the full-stop and space if any.
Txn of USD 00.00 done using TC XX at POS*MERCH on 30-Feb-22. Avl bal:USD 00.00. Call xxxxxx for dispute or SMS BLOCK xxxx to xxxxxxx
Acct XX debited with USD XX.00 on some date.Info: ABC*BDECS-XYZ.Avbl Bal:USD yy,xxx.95.Call xxxxxx for dispute or SMS BLOCK xx to xxxxx
screenshot from RegExr on regular pattern

What you're looking for is a look-ahead group. Whether you make that a positive look-ahead and use the negated character set \D or a negative look-ahead with the character set \d doesn't really matter- I'll outline both below:
regex = r". ?(?=\D)" # asserts that the following character matches \D
regex = r". ?(?!\d)" # asserts the following character does NOT match \d
There's also look-behind variants (?<!pattern) and (?<=pattern), which assert that the pattern doesn't/does match just before the current position.
None of these groups capture the matched text- they just "look ahead" or "look behind" without changing state.

Using \. ?[\D] is matching a single non digit char, but also that non digit char can be space or a newline by itself.
If you want to match a dot only, but not when it is the last character in the string, you can assert optional spaces without newlines.
Then match a non whitespace char not being a digit.
\.(?=[^\S\n]*[^\s\d])
The pattern matches:
\. Match a dot
(?= Positive lookahead to assert what is directly to the right of the current position is
[^\S\n]* Match optional whitespace chars without a newline
[^\s\d] Match a single non whitespace char other than a digit
) Close lookahead
See a regex demo.

Related

Exclude words with pattern 'xyyx' but include words that start & ends with same letter

I have a regex to match words that starts and ends with the same letter (excluding single characters like 'a', '1' )
(^.).*\1$
and another regex to avoid matching any strings with the format 'xyyx' (e.g 'otto', 'trillion', 'xxxx', '-[[-', 'fitting')
^(?!.*(.)(.)\2\1)
How do I construct a single regex to meet both of the requirements?
You can start the pattern with the negative lookahead followed by the pattern for the match. But note to change the backreference to \3 for the last pattern as the lookahead already uses group 1 and group 2.
Note that the . also matches a space, so if you don't want to match spaces you can use \S to match non whitespace chars instead.
^(?!.*(.)(.)\2\1)(.).*\3$
Regex demo
I would place the negative look-ahead after the initial character, and let it exclude the final character (as those two should be part of a positive capture):
^(.)(?!.*(.)\2.).*\1$
Note that the negative check concerns characters between the start and ending character, and so these words would not be rejected:
oopso
livewell

Optional group except when it precede with a match

I want to match any string that starts with . and word and then optionally any character after a space.
r"^\.(\w+)(?:\s+(.+)\b)?"
eg:
should match
.just one two
.just
.blah one#nine
.blah
.jargon blah
should not match
.jargon
I want this second group mandatory if first group is jargon
Using Python you can exclude matching only jargon using a negative lookahead, and then match 1 or more word characters
Then optionally match 1 or more whitespace characters excluding newlines followed by at least 1 or more characters without newlines.
^\.(?!jargon$)\w+(?:[^\S\n]+.+)?$
The pattern matches:
^ Start of string
\. Match a dot
(?!jargon$) Exlude matching jargon as the only word on the line
\w+ Match 1+ word characters
(?: Non capture group
[^\S\n]+.+ match 1+ whitespace chars excluding newline and then 1+ chars except newlines
)? Close non capture group and make it optional
$ End of string
See a regex demo and a Python demo.
Example
import re
strings = [
".just one two",
".just",
".blah one#nine",
".blah",
".jargon blah",
".jargon"
]
for s in strings:
m = re.match(r"\.(?!jargon$)\w+(?:[^\S\n]+.+)?$", s)
if m:
print(m.group())
Output
.just one two
.just
.blah one#nine
.blah
.jargon blah
One approach would be to phrase your requirement using an alternation:
^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$
This pattern says to match:
^ from the start of the input
\. match dot
(?:
(?!jargon\b)\w+ match a first term which is NOT "jargon"
(?: \S+)* then match optional following terms zero or more times
| OR
jargon match "jargon" as the first term
(?: \S+)+ then match mandatory one or more terms
)
$ end of the input
Here is a sample Python script:
inp = [".just one two", ".just", ".blah one#nine", ".blah", ".jargon blah", "jargon"]
matches = [x for x in inp if re.search(r'^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$', x)]
print(matches) # ['.just one two', '.just', '.blah one#nine', '.blah', '.jargon blah']
You could attempt to match the following regular expression:
^\.(?!jargon$)\w+(?= .|$).*
Demo
If successful, this will match the entire string. If one simply wants to know if the string conforms to the requirements .* can be dropped.
(?!jargon$) is a negative lookahead that asserts that the period is not immediately followed by 'jargon' at the end of the string.
(?= .|$) is a positive lookahead that asserts that the string of word characters is followed by a space followed by any character or they terminate the string.

Replace one part of a pattern in a string/sentence?

There is a text blob for example
"Text blob1. Text blob2. Text blob3 45.6%. Text blob4."
I want to replace the dots i.e. "." with space " ". But at the same time, dots appearing between numbers should be retained. For example, the previous example should be converted to:
"Text blob1 Text blob2 Text blob3 45.6% Text blob4"
If I use:
p = re.compile('\.')
s = p.sub(' ', s)
It replaces all dots with space.
Any suggestions on what pattern or method works here?
Use
\.(?!(?<=\d\.)\d)
See proof. This expression will match any dot that has no digit after it that is preceded with a digit and a dot.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead
You might not need regex here. Replace a dot-space with a space.
s.replace('. ', ' ')
That isn't good enough if you have any periods followed by a newline or that terminate the string, but you still wouldn't need a regex:
s.replace('. ', ' ').replace('.\n', '\n').rstrip('.')
Suppose the string were
A.B.C blob3 45.6%. Text blob4.
Match all periods other than those both preceded and followed by a digit
If after replacements, the string
A B C blob3 45.6% Text blob4
were desired, one could use re.sub with the regular expression
r'(?<!\d)\.|\.(?!\d)'
to replace matches of periods with empty strings.
The regex reads, "match a period that is not preceded by a character other than a digit or is not followed by a character other than a digit".
Demo 1
The double-negative is employed to match a period at the beginning or end of the string. One could instead use the logical equivalent:
r'(?<=^|\D)\.|\.(?=\D|$)'
Match all periods except those both preceded and followed by a whitespace character
On the other hand, if, after substitutions, the string
A.B.C blob3 45.6% Text blob4
were desired one could use re.sub with the regular expression
r'(?<!\S)\.|\.(?!\S)'
to replace matches of periods with empty strings.
This regex reads, "match a period that is not preceded by a character other than a whitespace or is not followed by a character other than a whitespace".
Demo 2
One could instead use the logical equivalent:
r'(?<=^|\s)\.|\.(?=\s|$)'

Python Regex is not working for the given pattern

I'm trying to extract a pattern from string using python regex. But it is not working with the below pattern
headerRegex = re.compile(r'^[^ ]*\s+\d*')
mo = headerRegex.search(string)
return mo.group()
My requirment is regular expression that should start with anything except white space and followed by one or more whitespace then digits occurence one or more
Example
i/p: test 7895 => olp:7895(correct)
i/p: 8545 ==> Not matching
i/p: #### 3453 ==>3453
May I know what is missing in my regex to implement this requirement?
In the pattern that you tried, only matching whitespace chars is mandatory, and you might possibly also match only newlines.
Change the quantifiers to + to match 1+ times, and if you don't want to match newlines as well use [^\S\r\n]+ instead.
If that exact match is only allowed, add an anchor $ to assert the end of the string, or add \Z if there is no newline following allowed.
^\S+[^\S\r\n]+\d+$
^ Start of string
\S+ Match 1+ times a non whitespace char
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
\d+ Match 1+ digits
$ End of string
Regex demo

Regex matching digit in between?

I would like to get number in between these strings.
strings = ["point_right: account ISLAMIC: 860328 9221 asdsad",
"account 723123123",
"account823123213",
"account 823.123.213",
"account 823-123-213",
"account:123213123 ",
"account: 123213123 asdasdsad 017-299906",
"account: 123213123",
"point_right: account ISLAMIC: 860328 9221"
]
Result would be
[860328 9221,723123123, 823123213, 823.123.213, 823-123-213, 123213123, 123213123, 123213123]
And i can do processing later to make them into number. So far my strategy is to get everything after pattern and anything before a letter. I have tried:
for string in strings:
print(re.findall("(?<=account)(.*)", string.lower()))
Please help to give some pointers on the regex match.
Try this pattern:
(?=[^0-9]*)[0-9][0-9 .-]*[0-9]
Breakdown:
(?=[^0-9]*) Lookahead for a word, such as "account", non-matching
[0-9] Find a digit
[0-9 .-]* Find any number of digits or special characters (in your strings you have spaces, dashes, periods so I included those)
[0-9] Find another digit (to prevent spaces at the end)
Check it out here, and sample code here
(?!\W)([\d\s.-]+)(?<!\s)
The negative lookahead and lookbehind seems like overkills here but I wasn't able to get a clean match otherwise. You may see the results here
(?!\W) Negative lookahead to exclude any non-word characters [^a-zA-Z0-9_]
([\d\s.-]+) The capturing group for your numbers
(?<!\s) Negative lookbehind to exclude whitespace characters [\r\n\t\f\v ]
If the numbers must be the first numbers after the account substring use
re.findall("account\D*([\d\s.-]*\d)", s)
See the Python demo and the regex demo.
Pattern details
account - a literal substring
\D* - 0+ chars other than digits
([\d\s.-]*\d) - Capturing group 1 (the value returned by re.findall): 0 or more digits, whitespaces, . and - chars followed with a digit.

Categories

Resources