How to replace single digit by same digit followed by punctuation? - python

I want to replace any single digit by the same digit followed by punctuation (comma ,) using python regex?
text = 'I am going at 5pm to type 3 and the 9 later'
I want this to be converted to
text = 'I am going at 5pm to type 3, and the 9, later'
My attempt:
match = re.search('\s\d{1}\s', x)
I could able to detect them but dont now how to replace by the same digit followed by comma.

Regex #1
See regex in use here
(?<=\b\d)\b
Replace with ,
How it works:
(?<=(?:)\d) positive lookbehind ensuring the following precedes:
\b assert position as a word boundary
\d match a digit
\b assert position as a word boundary
To prevent it from matching locations like 3, a simply append (?!,) to the regex.
Regex #2
To prevent matching a single digit at the start and end of the string, you can use the following regex:
See regex in use here
(?<=(?<!^)\b\d)\b(?!$)
Same as above regex, but adds following:
(?<!^) ensures the word boundary \b that it precedes doesn't match the start of the line
(?!$) ensure the word boundary \b that it follows doesn't match the end of the line
You can remove either token if that's not the behaviour you want.
To prevent it from matching locations like 3, a simply change the negative lookahead to (?!,|$) or append (?!,) to the regex.
Regex #3
If \b can't be used (e.g. if you have some numbers like 3.3), you can use the following instead:
See regex in use here
(?:(?<=\s\d)|(?<=^\d))(?=\s)
How it works:
(?:(?<=\s\d)|(?<=^\d)) match either of the following:
(?<=\s\d) positive lookbehind ensuring what precedes is a whitespace character
(?<=^\d) positive lookbehind ensuring what precedes is the start of the line
(?=\s) positive lookahead ensuring what follows is a whitespace character
Regex #4
If you don't need to match digits at the start of the string, modify the second regex by removing the second lookbehind as such:
See regex in use here
(?<=\s\d)(?=\s)
Code
Sample code (replace regex pattern with whichever pattern works best for you):
import re
x = 'I am going at 5pm to type 3 and the 9 later'
r = re.sub(r'(?<=\b\d)\b', ',', x)
print(r)

You could use a word boundary and a capture group to achieve this:
import re
text = 'I am going at 5pm to type 3 and the 9 later'
re.sub(r'\b(\d)\b', r"\1,", text)
# => 'I am going at 5pm to type 3, and the 9, later'

Related

Exclude words with pattern 'xyyx' but include words that start & ends with same letter

I have a regex to match words that starts and ends with the same letter (excluding single characters like 'a', '1' )
(^.).*\1$
and another regex to avoid matching any strings with the format 'xyyx' (e.g 'otto', 'trillion', 'xxxx', '-[[-', 'fitting')
^(?!.*(.)(.)\2\1)
How do I construct a single regex to meet both of the requirements?
You can start the pattern with the negative lookahead followed by the pattern for the match. But note to change the backreference to \3 for the last pattern as the lookahead already uses group 1 and group 2.
Note that the . also matches a space, so if you don't want to match spaces you can use \S to match non whitespace chars instead.
^(?!.*(.)(.)\2\1)(.).*\3$
Regex demo
I would place the negative look-ahead after the initial character, and let it exclude the final character (as those two should be part of a positive capture):
^(.)(?!.*(.)\2.).*\1$
Note that the negative check concerns characters between the start and ending character, and so these words would not be rejected:
oopso
livewell

How to extract string between space and symbol '>'?

String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?
You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']
In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']
The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]
Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Categories

Resources