I would like to replace [1-2] with 1, [3-4] with 3, [7-8] with 7, [2] with 2, and so on.
For example, I would like to use the following strings:
db[1-2].abc.xyz.pqr.abc.abc.com
db[3-4].abc.xyz.pqr.abc.abc.com
db[1].abc.xyz.pqr.abc.abc.com
xyz-db[1-2].abc.xyz.pqr.abc.abc.com
and convert them to
db1.abc.xyz.pqr.abc.abc.com
db3.abc.xyz.pqr.abc.abc.com
db1.abc.xyz.pqr.abc.abc.com
xyz-db1.abc.xyz.pqr.abc.abc.com
You could use a regex like:
^(.*)\[([0-9]+).*?\](.*)$
and replace it with:
$1$2$3
Here's what the regex does:
^ matches the beginning of the string
(.*) matches any character any amount of times, and is also the first capture group
\[ matches the character [ literally
([0-9]+) matches any number 1 or more times, and is also the second capture group
.*? matches any character any amount of times, but tries to find the smallest match
\] matches the character ] literally
(.*) matches any characters any amount of times
$ matches the end of the string
By replacing it with $1$2$3, you are replacing it with the text in the first capture group, followed by the text in the second capture group, followed by the text in the third capture group.
Here's a live preview on regex101.com
import re
def fixString(strToFix):
groups = re.match("(.*)\[(\d*).*\](.*)", strToFix).groups()
return "%s%s%s" % (groups[0], groups[1], groups[2])
Related
I want to match any string that starts with . and word and then optionally any character after a space.
r"^\.(\w+)(?:\s+(.+)\b)?"
eg:
should match
.just one two
.just
.blah one#nine
.blah
.jargon blah
should not match
.jargon
I want this second group mandatory if first group is jargon
Using Python you can exclude matching only jargon using a negative lookahead, and then match 1 or more word characters
Then optionally match 1 or more whitespace characters excluding newlines followed by at least 1 or more characters without newlines.
^\.(?!jargon$)\w+(?:[^\S\n]+.+)?$
The pattern matches:
^ Start of string
\. Match a dot
(?!jargon$) Exlude matching jargon as the only word on the line
\w+ Match 1+ word characters
(?: Non capture group
[^\S\n]+.+ match 1+ whitespace chars excluding newline and then 1+ chars except newlines
)? Close non capture group and make it optional
$ End of string
See a regex demo and a Python demo.
Example
import re
strings = [
".just one two",
".just",
".blah one#nine",
".blah",
".jargon blah",
".jargon"
]
for s in strings:
m = re.match(r"\.(?!jargon$)\w+(?:[^\S\n]+.+)?$", s)
if m:
print(m.group())
Output
.just one two
.just
.blah one#nine
.blah
.jargon blah
One approach would be to phrase your requirement using an alternation:
^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$
This pattern says to match:
^ from the start of the input
\. match dot
(?:
(?!jargon\b)\w+ match a first term which is NOT "jargon"
(?: \S+)* then match optional following terms zero or more times
| OR
jargon match "jargon" as the first term
(?: \S+)+ then match mandatory one or more terms
)
$ end of the input
Here is a sample Python script:
inp = [".just one two", ".just", ".blah one#nine", ".blah", ".jargon blah", "jargon"]
matches = [x for x in inp if re.search(r'^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$', x)]
print(matches) # ['.just one two', '.just', '.blah one#nine', '.blah', '.jargon blah']
You could attempt to match the following regular expression:
^\.(?!jargon$)\w+(?= .|$).*
Demo
If successful, this will match the entire string. If one simply wants to know if the string conforms to the requirements .* can be dropped.
(?!jargon$) is a negative lookahead that asserts that the period is not immediately followed by 'jargon' at the end of the string.
(?= .|$) is a positive lookahead that asserts that the string of word characters is followed by a space followed by any character or they terminate the string.
String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?
You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']
In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo
I want to replace any single digit by the same digit followed by punctuation (comma ,) using python regex?
text = 'I am going at 5pm to type 3 and the 9 later'
I want this to be converted to
text = 'I am going at 5pm to type 3, and the 9, later'
My attempt:
match = re.search('\s\d{1}\s', x)
I could able to detect them but dont now how to replace by the same digit followed by comma.
Regex #1
See regex in use here
(?<=\b\d)\b
Replace with ,
How it works:
(?<=(?:)\d) positive lookbehind ensuring the following precedes:
\b assert position as a word boundary
\d match a digit
\b assert position as a word boundary
To prevent it from matching locations like 3, a simply append (?!,) to the regex.
Regex #2
To prevent matching a single digit at the start and end of the string, you can use the following regex:
See regex in use here
(?<=(?<!^)\b\d)\b(?!$)
Same as above regex, but adds following:
(?<!^) ensures the word boundary \b that it precedes doesn't match the start of the line
(?!$) ensure the word boundary \b that it follows doesn't match the end of the line
You can remove either token if that's not the behaviour you want.
To prevent it from matching locations like 3, a simply change the negative lookahead to (?!,|$) or append (?!,) to the regex.
Regex #3
If \b can't be used (e.g. if you have some numbers like 3.3), you can use the following instead:
See regex in use here
(?:(?<=\s\d)|(?<=^\d))(?=\s)
How it works:
(?:(?<=\s\d)|(?<=^\d)) match either of the following:
(?<=\s\d) positive lookbehind ensuring what precedes is a whitespace character
(?<=^\d) positive lookbehind ensuring what precedes is the start of the line
(?=\s) positive lookahead ensuring what follows is a whitespace character
Regex #4
If you don't need to match digits at the start of the string, modify the second regex by removing the second lookbehind as such:
See regex in use here
(?<=\s\d)(?=\s)
Code
Sample code (replace regex pattern with whichever pattern works best for you):
import re
x = 'I am going at 5pm to type 3 and the 9 later'
r = re.sub(r'(?<=\b\d)\b', ',', x)
print(r)
You could use a word boundary and a capture group to achieve this:
import re
text = 'I am going at 5pm to type 3 and the 9 later'
re.sub(r'\b(\d)\b', r"\1,", text)
# => 'I am going at 5pm to type 3, and the 9, later'
I have a quick question on regex, I have a certain string to match. It is shown below:
"[someword] This Is My Name 2010"
or
"This Is My Name 2010"
or
"(someword) This Is My Name 2010"
Basically if given any of the strings above, I want to only keep "This Is My Name" and "2010".
What I have now, which I will use result = re.search and then result.group() to get the answer:
'[\]\)]? (.+) ([0-9]{4})\D'
Basically it works with the first and third case, by allowing me to optionally match the end bracket, have a space character, and then match "This Is My Name".
However, with the second case, it only matches "Is My Name". I think this is because of the space between the '?' and '(.+)'.
Is there a way to deal with this issue in pure regex?
One way I can think of is to add an "if" statement to determine if the word starts with a [ or ( before using the appropriate regex.
The pattern that you tried [\]\)]? (.+) ([0-9]{4})\D optionally matches a closing square bracket or parenthesis. Adding the \D at the end, it expects to match any character that is not a digit.
You can optionally match the (...) or [...] part before the first capturing group, as [])] only matches the optional closing one.
Then you can capture all that follows in group 1, followed by matching the last 4 digits in group 2 and add a word boundary.
(?:\([^()\n]*\) |\[[^][\n]*\] )?(.+) ([0-9]{4})\b
(?: Non capture group
([^()\n]*) Match either (...) and space
| Or
[[^][\n]*] [...] and space
)? Close group and make it optional
(.+) Capture group 1, Match 1+ times any char except a newline followed by a space
([0-9]{4})\b Capture group 2, match 4 digits
Regex demo
Note that .* will match until the end of the line and then backtracks until the last occurrence of 4 digits. If that should be the first occurrence, you could make it non greedy .*?
You can use re.sub to replace the first portion of the sentence if it starts with (square or round) brackets, with an empty string. No if statement is needed:
import re
s1 = "[someword] This Is My Name 2010"
s2 = "This Is My Name 2010"
s3 = "(someword) This Is My Name 2010"
reg = '\[.*?\] |\(.*?\) '
res1 = re.sub(reg, '', s1)
print(res1)
res2 = re.sub(reg, '', s2)
print(res2)
res3 = re.sub(reg, '', s3)
print(res3)
OUTPUT
This Is My Name 2010
This Is My Name 2010
This Is My Name 2010
I am using python regex to read documents.
I have the following line in many documents:
Dated: February 4, 2011 THE REAL COMPANY, INC
I can use python text search to easily find the lines that have "dated," but I want to pull THE REAL COMPANY, INC from the text without getting the "February 4, 2011" text.
I have tried the following:
[A-Z\s]{3,}.*INC
My understanding of this regex is it should get me all capital letters and spaces before LLP, but instead it pulls the full line.
This suggests to me I'm fundamentally missing something about how regex works with capital letters. Is there an easy and obvious explanation I'm missing?
what about using:
>>> import re
>>> txt
'Dated: February 4, 2011 THE REAL COMPANY, INC'
>>> re.findall('([A-Z][A-Z]+)', txt)
['THE', 'REAL', 'COMPANY', 'INC']
Another way around is as follows as suggested by #davedwards:
>>> re.findall('[A-Z\s]{3,}.*', txt)
[' THE REAL COMPANY, INC']
Explanation:
[A-Z\s]{3,}.*
Match a single character present in the list below [A-Z\s]{3,}
{3,} Quantifier — Matches between 3 and unlimited times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
You could use
^Dated:.*?\s([A-Z ,]{3,})
And make use of the first capturing group, see a demo on regex101.com.
Your regex [A-Z\s]{3,}.*INC matches 3 or more times an uppercase character or a whitespace character followed by 0+ times any character and then INC which will match: THE REAL COMPANY, INC
What you could also do is match Dated: from the start of the string followed by a date like format and then capture what comes after in a group. Your value will be in the first capturing group:
^Dated:\s+\S+\s+\d{1,2},\s+\d{4}\s+(.*)$
Explanation
^Dated:\s+ Match dated: followed by 1+ times a whitespace character
\S+\s+ Match 1+ times not a whitespace character followed by 1+ times a whitespace character whic will match February in this case
\d{1,2}, Match 1-2 times a digit
\s+\d{4}\s+ match 1+ times a whitespace character, 4 digits, followed by 1+ times a whitespace character
(.*) Capture in a group 0+ times any character
$ Assert the end of the string
Regex demo