Replace one part of a pattern in a string/sentence? - python

There is a text blob for example
"Text blob1. Text blob2. Text blob3 45.6%. Text blob4."
I want to replace the dots i.e. "." with space " ". But at the same time, dots appearing between numbers should be retained. For example, the previous example should be converted to:
"Text blob1 Text blob2 Text blob3 45.6% Text blob4"
If I use:
p = re.compile('\.')
s = p.sub(' ', s)
It replaces all dots with space.
Any suggestions on what pattern or method works here?

Use
\.(?!(?<=\d\.)\d)
See proof. This expression will match any dot that has no digit after it that is preceded with a digit and a dot.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead

You might not need regex here. Replace a dot-space with a space.
s.replace('. ', ' ')
That isn't good enough if you have any periods followed by a newline or that terminate the string, but you still wouldn't need a regex:
s.replace('. ', ' ').replace('.\n', '\n').rstrip('.')

Suppose the string were
A.B.C blob3 45.6%. Text blob4.
Match all periods other than those both preceded and followed by a digit
If after replacements, the string
A B C blob3 45.6% Text blob4
were desired, one could use re.sub with the regular expression
r'(?<!\d)\.|\.(?!\d)'
to replace matches of periods with empty strings.
The regex reads, "match a period that is not preceded by a character other than a digit or is not followed by a character other than a digit".
Demo 1
The double-negative is employed to match a period at the beginning or end of the string. One could instead use the logical equivalent:
r'(?<=^|\D)\.|\.(?=\D|$)'
Match all periods except those both preceded and followed by a whitespace character
On the other hand, if, after substitutions, the string
A.B.C blob3 45.6% Text blob4
were desired one could use re.sub with the regular expression
r'(?<!\S)\.|\.(?!\S)'
to replace matches of periods with empty strings.
This regex reads, "match a period that is not preceded by a character other than a whitespace or is not followed by a character other than a whitespace".
Demo 2
One could instead use the logical equivalent:
r'(?<=^|\s)\.|\.(?=\s|$)'

Related

Optional group except when it precede with a match

I want to match any string that starts with . and word and then optionally any character after a space.
r"^\.(\w+)(?:\s+(.+)\b)?"
eg:
should match
.just one two
.just
.blah one#nine
.blah
.jargon blah
should not match
.jargon
I want this second group mandatory if first group is jargon
Using Python you can exclude matching only jargon using a negative lookahead, and then match 1 or more word characters
Then optionally match 1 or more whitespace characters excluding newlines followed by at least 1 or more characters without newlines.
^\.(?!jargon$)\w+(?:[^\S\n]+.+)?$
The pattern matches:
^ Start of string
\. Match a dot
(?!jargon$) Exlude matching jargon as the only word on the line
\w+ Match 1+ word characters
(?: Non capture group
[^\S\n]+.+ match 1+ whitespace chars excluding newline and then 1+ chars except newlines
)? Close non capture group and make it optional
$ End of string
See a regex demo and a Python demo.
Example
import re
strings = [
".just one two",
".just",
".blah one#nine",
".blah",
".jargon blah",
".jargon"
]
for s in strings:
m = re.match(r"\.(?!jargon$)\w+(?:[^\S\n]+.+)?$", s)
if m:
print(m.group())
Output
.just one two
.just
.blah one#nine
.blah
.jargon blah
One approach would be to phrase your requirement using an alternation:
^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$
This pattern says to match:
^ from the start of the input
\. match dot
(?:
(?!jargon\b)\w+ match a first term which is NOT "jargon"
(?: \S+)* then match optional following terms zero or more times
| OR
jargon match "jargon" as the first term
(?: \S+)+ then match mandatory one or more terms
)
$ end of the input
Here is a sample Python script:
inp = [".just one two", ".just", ".blah one#nine", ".blah", ".jargon blah", "jargon"]
matches = [x for x in inp if re.search(r'^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$', x)]
print(matches) # ['.just one two', '.just', '.blah one#nine', '.blah', '.jargon blah']
You could attempt to match the following regular expression:
^\.(?!jargon$)\w+(?= .|$).*
Demo
If successful, this will match the entire string. If one simply wants to know if the string conforms to the requirements .* can be dropped.
(?!jargon$) is a negative lookahead that asserts that the period is not immediately followed by 'jargon' at the end of the string.
(?= .|$) is a positive lookahead that asserts that the string of word characters is followed by a space followed by any character or they terminate the string.

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

How do I "not" select first character in a regex pattern?

I am a RegEx beginner and trying to identify the endings of different statements in sms. See screenshot below.
How can I avoid selecting the next letter following by a full-stop that indicates ending of a statement.
Note that some statements have <.><Alphabets> while some have <.><space><Alphabets>
Regex used: r"\. ?[\D]"
Sample SMS: - I want to select just the full-stop and space if any.
Txn of USD 00.00 done using TC XX at POS*MERCH on 30-Feb-22. Avl bal:USD 00.00. Call xxxxxx for dispute or SMS BLOCK xxxx to xxxxxxx
Acct XX debited with USD XX.00 on some date.Info: ABC*BDECS-XYZ.Avbl Bal:USD yy,xxx.95.Call xxxxxx for dispute or SMS BLOCK xx to xxxxx
screenshot from RegExr on regular pattern
What you're looking for is a look-ahead group. Whether you make that a positive look-ahead and use the negated character set \D or a negative look-ahead with the character set \d doesn't really matter- I'll outline both below:
regex = r". ?(?=\D)" # asserts that the following character matches \D
regex = r". ?(?!\d)" # asserts the following character does NOT match \d
There's also look-behind variants (?<!pattern) and (?<=pattern), which assert that the pattern doesn't/does match just before the current position.
None of these groups capture the matched text- they just "look ahead" or "look behind" without changing state.
Using \. ?[\D] is matching a single non digit char, but also that non digit char can be space or a newline by itself.
If you want to match a dot only, but not when it is the last character in the string, you can assert optional spaces without newlines.
Then match a non whitespace char not being a digit.
\.(?=[^\S\n]*[^\s\d])
The pattern matches:
\. Match a dot
(?= Positive lookahead to assert what is directly to the right of the current position is
[^\S\n]* Match optional whitespace chars without a newline
[^\s\d] Match a single non whitespace char other than a digit
) Close lookahead
See a regex demo.

Python Regex is not working for the given pattern

I'm trying to extract a pattern from string using python regex. But it is not working with the below pattern
headerRegex = re.compile(r'^[^ ]*\s+\d*')
mo = headerRegex.search(string)
return mo.group()
My requirment is regular expression that should start with anything except white space and followed by one or more whitespace then digits occurence one or more
Example
i/p: test 7895 => olp:7895(correct)
i/p: 8545 ==> Not matching
i/p: #### 3453 ==>3453
May I know what is missing in my regex to implement this requirement?
In the pattern that you tried, only matching whitespace chars is mandatory, and you might possibly also match only newlines.
Change the quantifiers to + to match 1+ times, and if you don't want to match newlines as well use [^\S\r\n]+ instead.
If that exact match is only allowed, add an anchor $ to assert the end of the string, or add \Z if there is no newline following allowed.
^\S+[^\S\r\n]+\d+$
^ Start of string
\S+ Match 1+ times a non whitespace char
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
\d+ Match 1+ digits
$ End of string
Regex demo

Regex include only one digit between chars

I have to parse a PDF document and I'm using PyPDF2 with re(regex).
The file includes several lines like the one below:
18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40
I need to extract from this line the text( bold ) between the time and the amount:
PEDMILANO OVEST- BINASCOA
The following code is working but sometimes this code doesn't find anything since can be a number between these chars, for example, 18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40.
regex = re.compile(r'\d\d-\d\d-\d\d\d\d\d\d:\d\d:\d\d\D+\d+,\d\d')
Is there a way to include a number in this regular expression?
The following should simplify the current regex:
import re
s = '18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40'
re.search(r'\:\d+([A-Z].*?)(?=\d+\,\d+$)', s).group(1)
# 'PEDMILANO OVE3ST- BINASCOA'
See demo
\d+([A-Z].*?)(?=\d+\,\d+$)
\: matches the character : literally (case sensitive)
\d+: matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([A-Z].*?)
Match a single character present in the list below [A-Z]
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=\d+\,\d+$)
Assert that the Regex below matches
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
I suggest using
import re
text = "18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40"
print( re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', r'\1', text) )
It can also be written as
re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}|\d+(?:,\d+)?$', '', text)
Or, if you prefer matching and capturing:
m = re.search(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', text)
if m:
print( m.group(1) )
See an online Python demo. With this solution, your data may start with any char, and will contain any char (excluding line break chars, since your data is on single lines).
Regex details
^ - start of string
\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2} - datetime string: two digits, -, two digits, -, five or six digits, :, two digits, : two digits
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\d+(?:,\d+)? - an int/float value pattern: 1+ digits followed with an optional sequence of , and 1+ digits
$ - end of string.
See the regex demo.

Categories

Resources