How to keep a captured group only in Python regex? - python

Please accept my apology if this is a dumb question.
I want to make a regex expression that can make the two following changes in Python.
$12345.67890 to 12345.67
$12345 to 12345
What would be an appropriate regex expression to make both changes?
Thank you in advance.

We can try using re.sub here:
inp = "Here is a value $12345.67890 for replacement."
out = re.sub(r'\$(\d+(?:\.\d{1,2})?)\d*\b', '\\1', inp)
print(out)
This prints:
Here is a value 12345.67 for replacement.
Here is an explanation of the regex pattern:
\$ match $
( capture what follows
\d+ match one or more whole number digits
(?:\.\d{1,2})? then match an optional decimal component, with up to 2 digits
) close capture group (the output number you want)
\d* consume any remaining decimal digits, to remove them
\b until hitting a word boundary

in notepad++ style i would do something like
find \$(\d+\.)(\d\d)\d+
Replace \1\2
hope that helps

Related

capture the number iwth comma or dot with regex

I have regex code
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately, there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides
The main problem here is that (?![\d,]) fails any match followed with a digit or comma while you want to fail the match when it is followed with a digit or a comma plus a digit.
Replace (?![\d,]) with (?!,?\d).
Also, (?<!\S)(?<![\d,]) looks redundant, as (?<!\S) requires a whitespace or start of string and that is certainly not a digit or ,. Either use (?<!\S) or (?<!\d)(?<!\d,) depending on your requirements.
Join the negative lookaheads with OR: (?!x)(?!/) => (?!x|/) => (?![x/]).
You wnat to avoid matching years, but you just fail all numbers that start with them, so 2020222 won't get matched. Add (?!\d) to the lookahead, (?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d)).
So, the pattern might look like
(?<!\S)(?:(?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?!,?\d)[\u00BC-\u00BE\u2150-\u215E]?(?![x/])
See the regex demo.
IMPORTANT: You have [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) at the end, a negative lookahead after an optional pattern. Once the engine fails to find the match for x or /, it will backtrack and will most probably find a partial match. If you do not want to match 65,656 in 65,656½x, replace [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) with (?![\u00BC-\u00BE\u2150-\u215E]?[x/])[\u00BC-\u00BE\u2150-\u215E]?.
See another regex demo.

regex expression to get all digits before full stop

I wish to do as my title said but I cant seem to be able to do it.
string = "tex3591.45" #please be aware that my digit is in half-width
text_temp = re.findall("(\d.)", string)
My current output is:
['35', '91', '45']
My expected output is:
['3591.'] # with the "." at the end of the integer. No matter how many integer infront of this full stop
You need to escape the .:
text_temp = re.findall(r"\d+\.", string)
since . is a special character in regex, which matches any character. Added the + also to match 1 or more digits.
Or if you actually are using 'FULLWIDTH FULL STOP' (U+FF0E) you can just use the special character in the regex without escaping it:
text_temp = re.findall(r"\d+.", string)
You can use this regex along with re.findall to get your desired result
\d(?=.*?.)
will generate individual digits as answer
Demo in regex 101
\d+(?=.*?.)
Demo2
This will generate a bunch of numbers as one string
I used a positive lookahead and a greedy matching to check if there is a full stop after a certain digit and then give output. Hope this helps :).

[FORKING]Python Regex - Re.Sub and Re.Findall Interesting Challenges

Not sure if this is something that should be a bounty. II just want to understand regex better.
I checked the responses in the Regex to match pattern.one skip newlines and characters until pattern.two and Regex to match if given text is not found and match as little as possible threads and read about Tempered Greedy Token Solutions and Explicit Greedy Alternation Solutions on RexEgg, but admittedly the explanations baffled me.
I spent the last day fiddling mainly with re.sub (and with findall) because re.sub's behaviour is odd to me.
.
Problem 1:
Given Strings below with characters followed by / how would I produce a SINGLE regex (using only either re.sub or re.findall) that uses alternating capture groups which must use [\S]+/ to get the desired output
>>> string_1 = 'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
>>> string_2 = 'variety.com/2017/biz/the/life/of/madam/green/news/tax-march-donald-trump-protest-1202031487/'
>>> string_3 = 'variety.com/2017/biz/the/life/of/news/tax-march-donald-trump-protest-1202031487/the/days/of/our/lives'
Desired Output Given the Conditions(!!)
tax-march-donald-trump-protest-
CONDITIONS: Must use alternating capture groups which must capture ([\S]+) or ([\S]+?)/ to capture the other groups but ignore them if they don't contain -
I'M WELL AWARE that it would be better to use re.findall('([\-]*(?:[^/]+?\-)+)[\d]+', string) or something similar but I want to know if I can use [\S]+ or ([\S]+) or ([\S]+?)/ and tell regex that if those are captured, ignore the result if it contains / or doesn't contain - While also having used an alternating capture group
I KNOW I don't need to use [\S]+ or ([\S]+) but I want to see if there is an extra directive I can use to make the regex reject some characters those two would normally capture.
Posted per request:
(?:(?!/)[\S])*-(?:(?!/)[\S])*
https://regex101.com/r/azrwjO/1
Explained
(?: # Optional group
(?! / ) # Not a forward slash ahead
[\S] # Not whitespace class
)* # End group, do 0 to many times
- # A dash must exist
(?: # Optional group, same as above
(?! / )
[\S]
)*
You could use
/([-a-z]+)-\d+
and take the first capturing group, see a demo on regex101.com.

python re.sub replace number in match string

Replacing numbers in a string after the match.
t = 'String_BA22_FR22_BC'
re.sub(r'FR(\d{1,})','',t)
My desired output would be String_BA22_FR_BC
You may use
re.sub(r'FR\d+','FR',t)
Alternatively, you may capture the part you need to keep with a capturing group and replace with \1 backreference:
re.sub(r'(FR)\d+', r'\1', t)
^--^- >>>----^
See the Python demo
A capturing group approach is flexible as it allows patterns of unlimited length.
You are replacing what you match (in this case FR22) with an empty string.
Another option is to use a positive lookbehind an then match 1+ digits adn replace that match with an empty string:
(?<=FR)\d+
Regex demo | Python demo
For example:
re.sub(r'(?<=FR)\d+','',t)

regular expression ending

I have ONE string as plain text and want to extract phone numbers of any format from it.
Here is my regex:
r = re.compile(r"(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)[-\s*]\d{3}[-\.\s]??\d{4})")
It extracts the following matches correctly:
617.933.6444
(880)-567-4565
(880) 567-4565
222-333-8888
555 666 4444
9999999999
But how can I avoid getting 7986815059 when I have 798681505951 in the text?
How to make an ending for my regex? (it should not contain letters and digits after and before, exact number count must be 10)
!!!!
Decision
If somebody needs to find US phone numbers in string, use link from the last Wiktor Stribiżew comment.
You need to use word boundaries, but placing them into your pattern is not obvious. It is due to the fact that the second alternative starts with a non-word char, \(. Thus, the first \b must be added at the beginning of the first alternative, and the trailing one at the very end of the pattern:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^ ^^
See the regex demo
You may also require a non-word char or start of string before (. Then add \B at the second alternative start:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\B\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^
See another demo
Also, note that there is no need escaping a . inside a character class, it is already parsed as a literal dot in [.]. And no need using a lazy ?? quantifier, it does not make sense here and a greedy version, ?, will work equally well and will look "cleaner".

Categories

Resources