Not sure if this is something that should be a bounty. II just want to understand regex better.
I checked the responses in the Regex to match pattern.one skip newlines and characters until pattern.two and Regex to match if given text is not found and match as little as possible threads and read about Tempered Greedy Token Solutions and Explicit Greedy Alternation Solutions on RexEgg, but admittedly the explanations baffled me.
I spent the last day fiddling mainly with re.sub (and with findall) because re.sub's behaviour is odd to me.
.
Problem 1:
Given Strings below with characters followed by / how would I produce a SINGLE regex (using only either re.sub or re.findall) that uses alternating capture groups which must use [\S]+/ to get the desired output
>>> string_1 = 'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
>>> string_2 = 'variety.com/2017/biz/the/life/of/madam/green/news/tax-march-donald-trump-protest-1202031487/'
>>> string_3 = 'variety.com/2017/biz/the/life/of/news/tax-march-donald-trump-protest-1202031487/the/days/of/our/lives'
Desired Output Given the Conditions(!!)
tax-march-donald-trump-protest-
CONDITIONS: Must use alternating capture groups which must capture ([\S]+) or ([\S]+?)/ to capture the other groups but ignore them if they don't contain -
I'M WELL AWARE that it would be better to use re.findall('([\-]*(?:[^/]+?\-)+)[\d]+', string) or something similar but I want to know if I can use [\S]+ or ([\S]+) or ([\S]+?)/ and tell regex that if those are captured, ignore the result if it contains / or doesn't contain - While also having used an alternating capture group
I KNOW I don't need to use [\S]+ or ([\S]+) but I want to see if there is an extra directive I can use to make the regex reject some characters those two would normally capture.
Posted per request:
(?:(?!/)[\S])*-(?:(?!/)[\S])*
https://regex101.com/r/azrwjO/1
Explained
(?: # Optional group
(?! / ) # Not a forward slash ahead
[\S] # Not whitespace class
)* # End group, do 0 to many times
- # A dash must exist
(?: # Optional group, same as above
(?! / )
[\S]
)*
You could use
/([-a-z]+)-\d+
and take the first capturing group, see a demo on regex101.com.
Related
This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?
Here is a list of input strings:
"collect_project_stage1_20220927_foot60cm_arm70cm_height170cm_......",
"collect_project_version_1_0927_foot60cm_height170cm_......",
"collect_project_ver1_20220927_arm70cm_height170cm_......",
These input strings are provided by many different users.
Leading "collect_" is fixed, and then follows "${project_version}" which doesn't have hard rule to set this variable, the naming will be very different by different users.
Then, there will be repeating "${part}${length}cm_.......", but the number of repeatence is not fixed.
I'd like to capture the the variable ${project_version}.
Then, I try using the following re.match to capture it.
re.match(r'collect_(.*)_(?:(?:foot|arm|height)\d+cm_)+.*' , string)
However, the result is not as expected.
Is there anyone give me a hint that what's wrong in my regular expression?
Assuming you were only planning to capture the part preceding the various cm suffixed components, the reason you're capturing so many of them instead of just checking and discarding them is that regexes are greedy by default.
You can narrow your capture group to only match what you really expect (e.g. just a name followed by a date), replacing (.*) with something like ((?:[a-z]+[0-9]*_)*\d{8}).
Alternatively, you can be lazy and enable non-greedy matching for the capture group, changing (.*) to (.*?) where the ? says to only take the minimal amount required to satisfy the regex. The latter is more brittle, but if you really can't impose any other restrictions on the expression for the capture group, it's what you've got.
Use a non-greedy quantifier. Otherwise, the capture group will match as far as it can, so it will keep going until the last match for (?:foot|arm|height)\d+cm_).
result = re.match(r'collect_(.*?)_(?:(?:foot|arm|height)\d+cm_)+' , string)
print(result.group(1)) # project_stage1_20220927
The regex "(.*)" will capture far too much.
re.match(r'collect_([a-z0-9]+_[a-z0-9]+_[a-z0-9]+)_(?:(?:foot|arm|height)\d+cm_)+' , string)
I need assistance with matching spaces and subsequent matches in regex.
the example is as follows:
I want to match all of the following scenarios:
60 ml ( 1)
60ML (2 )
60ml(2) (a)
the regex I have used is:
(60\s?(?:ml)\s?(?:\w|\(.{0,3}\)){0,5})
link to the example: link to regex
the regex matches the first 2 examples, but not the instances where there is a space between (2) and (a).
any guidance would be appreciated.
Your regex doesn't allow for spaces between the parenthesised groups (2) and (a) in your last example. You can add <space>* to it to allow it to do so. Note you cannot use \s* unless you are only matching a single value at a time, otherwise the fact that \s will match newline can cause the first match to go too far.
(60\s?ml\s?(?:\w|\(.{0,3}\) *){0,5})
Note that without anchors counting repetitions doesn't really make sense. For example, this regex will match both 60ML (2 )(a)(a)(a)(a) and 60ML (2 )(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a), returning 60ML (2 )(a)(a)(a)(a) in both cases. If that is not what you want, you will need to add an anchor to the end of the regex ($ perhaps) to prevent it matching the longer string.
Demo on regex101
I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.
Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):
I have few data lines
ReadPosRankSum=### SNPEFF_AMINO_ACID_CHANGE=p.Pro3Pro/c.9T>C SNPEFF_CODON_CHANGE=ccT/ccC
ReadPosRankSum=### SNPEFF_AMINO_ACID_CHANGE=p.Trp7Ser/c.20G>C SNPEFF_CODON_CHANGE=tGg/tCg
ReadPosRankSum=### SNPEFF_AMINO_ACID_CHANGE=p.Lys17Arg/c.50A>G SNPEFF_CODON_CHANGE=aAa/aGa
and so on..
I want to be able to extract just the values for the keys SNPEFF_AMINO_ACID_CHANGE, that is p.Pro3Pro/c.9T>C, p.Trp7Ser/c.20G>C and p.Lys17Arg/c.50A>G. Any ideas on how to create a pattern for this?
Usually when questions like this are asked some effort needs to be shown. So please take consideration to state the exact problem with at least some effort on what you have attempted next time.
To get you started, you could try the following regular expression:
>>> re.findall(r'SNPEFF_AMINO_ACID_CHANGE=(\S+)', text)
This will extract the values from the pattern and store them in a list.
Explanation:
SNPEFF_AMINO_ACID_CHANGE= # match 'SNPEFF_AMINO_ACID_CHANGE='
( # group and capture to \1:
\S+ # non-whitespace (1 or more times)
) # end of \1
Working Demo