I have a quick question on regex, I have a certain string to match. It is shown below:
"[someword] This Is My Name 2010"
or
"This Is My Name 2010"
or
"(someword) This Is My Name 2010"
Basically if given any of the strings above, I want to only keep "This Is My Name" and "2010".
What I have now, which I will use result = re.search and then result.group() to get the answer:
'[\]\)]? (.+) ([0-9]{4})\D'
Basically it works with the first and third case, by allowing me to optionally match the end bracket, have a space character, and then match "This Is My Name".
However, with the second case, it only matches "Is My Name". I think this is because of the space between the '?' and '(.+)'.
Is there a way to deal with this issue in pure regex?
One way I can think of is to add an "if" statement to determine if the word starts with a [ or ( before using the appropriate regex.
The pattern that you tried [\]\)]? (.+) ([0-9]{4})\D optionally matches a closing square bracket or parenthesis. Adding the \D at the end, it expects to match any character that is not a digit.
You can optionally match the (...) or [...] part before the first capturing group, as [])] only matches the optional closing one.
Then you can capture all that follows in group 1, followed by matching the last 4 digits in group 2 and add a word boundary.
(?:\([^()\n]*\) |\[[^][\n]*\] )?(.+) ([0-9]{4})\b
(?: Non capture group
([^()\n]*) Match either (...) and space
| Or
[[^][\n]*] [...] and space
)? Close group and make it optional
(.+) Capture group 1, Match 1+ times any char except a newline followed by a space
([0-9]{4})\b Capture group 2, match 4 digits
Regex demo
Note that .* will match until the end of the line and then backtracks until the last occurrence of 4 digits. If that should be the first occurrence, you could make it non greedy .*?
You can use re.sub to replace the first portion of the sentence if it starts with (square or round) brackets, with an empty string. No if statement is needed:
import re
s1 = "[someword] This Is My Name 2010"
s2 = "This Is My Name 2010"
s3 = "(someword) This Is My Name 2010"
reg = '\[.*?\] |\(.*?\) '
res1 = re.sub(reg, '', s1)
print(res1)
res2 = re.sub(reg, '', s2)
print(res2)
res3 = re.sub(reg, '', s3)
print(res3)
OUTPUT
This Is My Name 2010
This Is My Name 2010
This Is My Name 2010
Related
the co[njuring](media_title)
I want a regex to detect if a pattern like above exist.
Currently I have a regex that turns
line = Can I please eat at[ warunk upnormal](restaurant_name)
line = re.sub('\[\s*(.*?)\s*\]', r'[\1]', line)
line = re.sub(r'(\w)\[', r'\1 [', line)
Can I please eat at [warunk upnormal](restaurant_name)
Notice how there aren't any spaces which is good, and it creates a space char and brace ex. x[ to x [
What I want, is to change the above to regexes to not perform the change if there is a sentences like this
the co[njuring](media_title)
the co[njuring](media_title) and che[ese dog]s(food)
Notice how there is a brace in there. Basically, I want to know how can I improve these regexes to take this into account.
line = re.sub('\[\s*(.*?)\s*\]', r'[\1]', line)
line = re.sub(r'(\w)\[', r'\1 [', line)
For the 2 patterns that you use, you could also use a single pattern with 2 capturing groups.
(\w)\[\s*(.*?)\s*\]
Regex demo and a Python demo
In the replacement use the 2 capturing groups \1 [\2]
Example code
line = re.sub('(\w)\[\s*(.*?)\s*\]', r'\1 [\2]', line)
The different in the given format that I see is that there is an underscore present (instead of a brace) between the parenthesis (restaurant_name) and (media_title) vs (food)
If that is the case, you can use a third capturing group, matching the value in parenthesis with at least a single underscore present, not at the start and not at the end.
(\w)\[\s*(.*?)\s*\](\([^_\s()]+(?:_[^_\s()]+)+\))
Explanation
(\w) Capture group 1, match a word char
\[\s* Match [ and 0+ whitespace chars
(.*?) Capture group 2, match any char except a newline non greedy
\s*\] Match 0+ whitespace chars and ]
( Capture group 3
\( Match (
[^_\s()]+ Match 1+ times any char except an underscore, whitespace char or parenthesis
(?:_[^_\s()]+)+ Repeat 1+ times the previous pattern with an underscore prepended
\) Match )
) Close group
In the replacement use the 3 capturing groups \1 [\2]\3
Regex demo and a Python demo
Example code
import re
regex = r"(\w)\[\s*(.*?)\s*\](\([^_\s()]+(?:_[^_\s()]+)+\))"
test_str = ("Can I please eat at[ warunk upnormal](restaurant_name)\n"
"Can I please eat at[ warunk upnormal ](restaurant_name)\n"
"the co[njuring](media_title)\n"
"the co[njuring](media_title) and che[ese dog]s(food)")
result = re.sub(regex, r"\1 [\2]\3", test_str)
if result:
print (result)
Output
Can I please eat at [warunk upnormal](restaurant_name)
Can I please eat at [warunk upnormal](restaurant_name)
the co [njuring](media_title)
the co [njuring](media_title) and che[ese dog]s(food)
I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']
I have the following string that I need to split into smaller ones in a correct way:
s = "A=3, B=value one, value two, value three, C=NA, D=Other institution, except insurance, id=DRT_12345"
I cannot do the following, since I need to split only on the last "," before the "="
s.split(",")
My desired outcome is the following:
out = ["A=3",
"B=value one, value two, value three",
"C=NA",
"D=Other institution, except insurance",
"id=DRT_12345"]
Following the structure of your string, you can use re.findall:
import re
re.findall(r'\S+=.*?(?=, \S+=|$)', s)
['A=3',
'B=value one, value two, value three',
'C=NA',
'D=Other institution, except insurance',
'id=DRT_12345']
The pattern uses a lookahead to determine when to stop matching for the current key-value pair.
\S+ # match or more non-whitespace characters
= # ...followed by an equal sign
.*? # match anything upto...
(?= # regex lookahead for
, # comma, followed by
\s # a whitespace, followed by
\S+ # the same pattern
=
| # OR
$ # EOL
)
Split on "the last comma before the equals" can be translated to a regex like this:
import re
out = re.split(r',(?=[^,]*=)', s)
That's a comma (,), followed by (a positive lookahead - (?= .. )) any number of non-comma characters ([^,]*) and then an equals sign (=).
From a string such as
70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30
I want to get the first parenthesized content linux;u;android4.2.1;zh-cn.
My code looks like this:
s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
re.search("(\d+)\s.+\((\S+)\)", s).group(2)
but the result is the last brackets' contents khtml,likegecko.
How to solve this?
The main issue you have is the greedy dot matching .+ pattern. It grabs the whole string you have, and then backtracks, yielding one character from the right at a time, trying to accommodate for the subsequent patterns. Thus, it matches the last parentheses.
You can use
^(\d+)\s[^(]+\(([^()]+)\)
See the regex demo. Here, the [^(]+ restricts the matching to the characters other than ( (so, it cannot grab the whole line up to the end) and get to the first pair of parentheses.
Pattern expalantion:
^ - string start (NOTE: If the number appears not at the start of the string, remove this ^ anchor)
(\d+) - Group 1: 1 or more digits
\s - a whitespace (if it is not a required character, it can be removed since the subsequent negated character class will match the space)
[^(]+ - 1+ characters other than (
\( - a literal (
([^()]+) - Group 2 matching 1+ characters other than ( and )
\)- closing ).
Debuggex Demo
Here is the IDEONE demo:
import re
p = re.compile(r'^(\d+)\s[^(]+\(([^()]+)\)')
test_str = "70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30"
print(p.findall(test_str))
# or using re.search if the number is not at the beginning of the string
m = re.search(r'(\d+)\s[^(]+\(([^()]+)\)', test_str)
if m:
print("Number: {0}\nString: {1}".format(m.group(1), m.group(2)))
# [('70849', 'linux;u;android4.2.1;zh-cn')]
# Number: 70849
# String: linux;u;android4.2.1;zh-cn
You can use a negated class \(([^)]*)\) to match anything between ( and ):
>>> s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
>>> m = re.search(r"(\d+)[^(]*\(([^)]*)\)", s)
>>> print m.group(1)
70849
>>> print m.group(2)
linux;u;android4.2.1;zh-cn
I would like to replace [1-2] with 1, [3-4] with 3, [7-8] with 7, [2] with 2, and so on.
For example, I would like to use the following strings:
db[1-2].abc.xyz.pqr.abc.abc.com
db[3-4].abc.xyz.pqr.abc.abc.com
db[1].abc.xyz.pqr.abc.abc.com
xyz-db[1-2].abc.xyz.pqr.abc.abc.com
and convert them to
db1.abc.xyz.pqr.abc.abc.com
db3.abc.xyz.pqr.abc.abc.com
db1.abc.xyz.pqr.abc.abc.com
xyz-db1.abc.xyz.pqr.abc.abc.com
You could use a regex like:
^(.*)\[([0-9]+).*?\](.*)$
and replace it with:
$1$2$3
Here's what the regex does:
^ matches the beginning of the string
(.*) matches any character any amount of times, and is also the first capture group
\[ matches the character [ literally
([0-9]+) matches any number 1 or more times, and is also the second capture group
.*? matches any character any amount of times, but tries to find the smallest match
\] matches the character ] literally
(.*) matches any characters any amount of times
$ matches the end of the string
By replacing it with $1$2$3, you are replacing it with the text in the first capture group, followed by the text in the second capture group, followed by the text in the third capture group.
Here's a live preview on regex101.com
import re
def fixString(strToFix):
groups = re.match("(.*)\[(\d*).*\](.*)", strToFix).groups()
return "%s%s%s" % (groups[0], groups[1], groups[2])