Where is such a regex wrong? - python

I am using python.
The pattern is:
re.compile(r'^(.+?)-?.*?\(.+?\)')
The text like:
text1 = 'TVTP-S2(xxxx123123)'
text2 = 'TVTP(xxxx123123)'
I expect to get TVTP

Another option to match those formats is:
^([^-()]+)(?:-[^()]*)?\([^()]*\)
Explanation
^ Start of string
([^-()]+) Capture group 1, match 1+ times any character other than - ( and )
(?:-[^()]*)? As the - is excluded from the first part, optionally match - followed by any char other than ( and )
\([^()]*\) Match from ( till ) without matching any parenthesis between them
Regex demo | Python demo
Example
import re
regex = r"^([^-()]+)(?:-[^()]*)?\([^()]*\)"
s = ("TVTP-S2(xxxx123123)\n"
"TVTP(xxxx123123)\n")
print(re.findall(regex, s, re.MULTILINE))
Output
['TVTP', 'TVTP']

This regex works:
pattern = r'^([^-]+).*\(.+?\)'
>>> re.findall(pattern, 'TVTP-S2(xxxx123123)')
['TVTP']
>>> re.findall(pattern, 'TVTP(xxxx123123)')
['TVTP']

a quick answer will be
^(\w+)(-.*?)?\((.*?)\)$
https://regex101.com/r/wL4jKe/2/

It is because the first plus is lazy, and the subsequent dash is optional, followed by a pattern that allows any character.
This allows the regex engine to choose the single letter T for the first group (because it is lazy), choose to interpret the dash as just not being there, which is allowed because it is followed by a question mark, and then have the next .* match "VTP-S2".
You can just grab non-dashes to capture, followed by nonparentheses up to the parentheses.
p=re.compile(r'^([^-]*?)[^(]*\(.+?\)')
p.search('TVTP-S2(xxxx123123) blah()').group(1)
The nonparentheses part prevents the second portion from matching 'S2(xxxx123123) blah(' in my modified example above.

Related

Regex string between square brackets only if '.' is within string

I'm trying to detect the text between two square brackets in Python however I only want the result where there is a "." within it.
I currently have [(.*?] as my regex, using the following example:
String To Search:
CASE[Data Source].[Week] = 'THIS WEEK'
Result:
Data Source, Week
However I need the whole string as [Data Source].[Week], (square brackets included, only if there is a '.' in the middle of the string). There could also be multiple instances where it matches.
You might write a pattern matching [...] and then repeat 1 or more times a . and again [...]
\[[^][]*](?:\.\[[^][]*])+
Explanation
\[[^][]*] Match from [...] using a negated character class
(?: Non capture group to repeat as a whole part
\.\[[^][]*] Match a dot and again [...]
)+ Close the non capture group and repeat 1+ times
See a regex demo.
To get multiple matches, you can use re.findall
import re
pattern = r"\[[^][]*](?:\.\[[^][]*])+"
s = ("CASE[Data Source].[Week] = 'THIS WEEK'\n"
"CASE[Data Source].[Week] = 'THIS WEEK'")
print(re.findall(pattern, s))
Output
['[Data Source].[Week]', '[Data Source].[Week]']
If you also want the values of between square brackets when there is not dot, you can use an alternation with lookaround assertions:
\[[^][]*](?:\.\[[^][]*])+|(?<=\[)[^][]*(?=])
Explanation
\[[^][]*](?:\.\[[^][]*])+ The same as the previous pattern
| Or
(?<=\[)[^][]*(?=]) Match [...] asserting [ to the left and ] to the right
See another regex demo
I think an alternative approach could be:
import re
pattern = re.compile("(\[[^\]]*\]\.\[[^\]]*\])")
print(pattern.findall(sss))
OUTPUT
['[Data Source].[Week]']

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

Regex : matching integers inside of brackets

I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?
You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8
No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.
A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.

Regex similar to a Hearst Pattern in Python

I'm trying to come up with a regex similiar to the ones listed here for Hearst Patterns in order to get the following results:
NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
NP_The_Eleventh_Air_Force (NP_11_AF) is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
Doing re.search(regex, sentence) for each of this sentences I want to match this 2 groupsNP_The_Eleventh_Air_Force NP_a_Numbered_Air_Force
This is my attempt but it doesn't get any matches:
(NP_\\w+ (, )?is (NP_\\w+ ?))
In both sentences I think (, )? is not present, but the part before between parenthesis is so you could make that part optional instead.
Also move the last parenthesis from )) to (NP_\w+) to create the first group.
The pattern including the optional comma and space could be:
(NP_\w+)(?: \([^()]+\))? (?:, )?is (NP_\w+ ?)
Regex demo
If you don't need the space at the end and the comma space is not present, you pattern could be:
(NP_\w+)(?: \([^()]+\))? is (NP_\w+)
(NP_\w+) Capture group 1 Match NP_ and 1+ word chars
(?: \([^()]+\))? Optionally match a space and a part with parenthesis
is Match literally
(NP_\w+) Capture group 2 Match NP_ and 1+ word chars
See a regex demo | Python demo
For example
import re
regex = r"(NP_\w+)(?: \([^()]+\))? is (NP_\w+)"
test_str = "NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF)."
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
print(matches.group(2))
Output
NP_The_Eleventh_Air_Force
NP_a_Numbered_Air_Force
I got one, quite simple:
regex = r"NP.\w+ ?Forces?\b
You can see how it works out, it's a online tool to write and test regex for multiple languages:
https://regex101.com/r/KKH3D3/1/

Match first parenthesis with Python

From a string such as
70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30
I want to get the first parenthesized content linux;u;android4.2.1;zh-cn.
My code looks like this:
s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
re.search("(\d+)\s.+\((\S+)\)", s).group(2)
but the result is the last brackets' contents khtml,likegecko.
How to solve this?
The main issue you have is the greedy dot matching .+ pattern. It grabs the whole string you have, and then backtracks, yielding one character from the right at a time, trying to accommodate for the subsequent patterns. Thus, it matches the last parentheses.
You can use
^(\d+)\s[^(]+\(([^()]+)\)
See the regex demo. Here, the [^(]+ restricts the matching to the characters other than ( (so, it cannot grab the whole line up to the end) and get to the first pair of parentheses.
Pattern expalantion:
^ - string start (NOTE: If the number appears not at the start of the string, remove this ^ anchor)
(\d+) - Group 1: 1 or more digits
\s - a whitespace (if it is not a required character, it can be removed since the subsequent negated character class will match the space)
[^(]+ - 1+ characters other than (
\( - a literal (
([^()]+) - Group 2 matching 1+ characters other than ( and )
\)- closing ).
Debuggex Demo
Here is the IDEONE demo:
import re
p = re.compile(r'^(\d+)\s[^(]+\(([^()]+)\)')
test_str = "70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30"
print(p.findall(test_str))
# or using re.search if the number is not at the beginning of the string
m = re.search(r'(\d+)\s[^(]+\(([^()]+)\)', test_str)
if m:
print("Number: {0}\nString: {1}".format(m.group(1), m.group(2)))
# [('70849', 'linux;u;android4.2.1;zh-cn')]
# Number: 70849
# String: linux;u;android4.2.1;zh-cn
You can use a negated class \(([^)]*)\) to match anything between ( and ):
>>> s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
>>> m = re.search(r"(\d+)[^(]*\(([^)]*)\)", s)
>>> print m.group(1)
70849
>>> print m.group(2)
linux;u;android4.2.1;zh-cn

Categories

Resources