Single regular expression for extracting different values - python

I have some inputs like
ID= 5657A
ID=PID=FSGDVD
IDS=5645SD
I have created a regex i.e IDS=[A-Za-z0-9]+|ID=[A-Za-z0-9]+|PID=[A-Za-z0-9]+. But, in the case of ID=PID=FSGDVD, I want PID=FSGDVD as output.
My outputs must look like
ID= 5657A
PID=FSGDVD
IDS=5645SD
How to go for this problem?

Add end of line anchor and use grouping and quantifiers to simplify the regex:
(?:IDS?|PID)=[A-Za-z0-9]+$
IDS? will match both ID and IDS
(?:IDS?|PID) will match ID or IDS or PID
(?:pattern) is a non-capturing group, some functions like re.split and re.findall will change their behavior based on capture groups, thus non-capturing group is ideal whenever backreferences aren't needed
$ is end of line anchor, thus you'll get the match towards end of line instead of start of line
Demo: https://regex101.com/r/e9uvmC/1
In case your input can be something like ID=PID=FSGDVD xyz then you could use lookarounds:
(?:IDS?|PID)=[A-Za-z0-9]+\b(?!=)
Here \b will ensure to match all word characters after = sign and (?!=) is a negative lookahead assertion to avoid a match if there is = afterwards
Demo: https://regex101.com/r/e9uvmC/2

Another one could be
[A-Z]+=\s*[^=]+$
See a demo on regex101.com.

Related

python regex, capturing a pattern with trimming repeated subpattern in string

Here is a list of input strings:
"collect_project_stage1_20220927_foot60cm_arm70cm_height170cm_......",
"collect_project_version_1_0927_foot60cm_height170cm_......",
"collect_project_ver1_20220927_arm70cm_height170cm_......",
These input strings are provided by many different users.
Leading "collect_" is fixed, and then follows "${project_version}" which doesn't have hard rule to set this variable, the naming will be very different by different users.
Then, there will be repeating "${part}${length}cm_.......", but the number of repeatence is not fixed.
I'd like to capture the the variable ${project_version}.
Then, I try using the following re.match to capture it.
re.match(r'collect_(.*)_(?:(?:foot|arm|height)\d+cm_)+.*' , string)
However, the result is not as expected.
Is there anyone give me a hint that what's wrong in my regular expression?
Assuming you were only planning to capture the part preceding the various cm suffixed components, the reason you're capturing so many of them instead of just checking and discarding them is that regexes are greedy by default.
You can narrow your capture group to only match what you really expect (e.g. just a name followed by a date), replacing (.*) with something like ((?:[a-z]+[0-9]*_)*\d{8}).
Alternatively, you can be lazy and enable non-greedy matching for the capture group, changing (.*) to (.*?) where the ? says to only take the minimal amount required to satisfy the regex. The latter is more brittle, but if you really can't impose any other restrictions on the expression for the capture group, it's what you've got.
Use a non-greedy quantifier. Otherwise, the capture group will match as far as it can, so it will keep going until the last match for (?:foot|arm|height)\d+cm_).
result = re.match(r'collect_(.*?)_(?:(?:foot|arm|height)\d+cm_)+' , string)
print(result.group(1)) # project_stage1_20220927
The regex "(.*)" will capture far too much.
re.match(r'collect_([a-z0-9]+_[a-z0-9]+_[a-z0-9]+)_(?:(?:foot|arm|height)\d+cm_)+' , string)

How to match substring or whole string

In Python regex, how would I match only the facebook.com...777 substrings given either string? I don't want the ?sfnsn=mo at the end.
I have (?<=https://m\.)([^\s]+) to match everything after the https://m.. I also have (?=\?sfnsn) to match every thing in front of ?sfnsn.
How do I combine the regex to only return the facebook.com...777 part for either string.
have: https://m.facebook.com/story.php?story_fbid=123456789&id=7777777777?sfnsn=mo
want: facebook.com/story.php?story_fbid=123456789&id=7777777777
have: https://m.facebook.com/story.php?story_fbid=123456789&id=7777777777
want: facebook.com/story.php?story_fbid=123456789&id=7777777777
Here's what I was messing around with https://regex101.com/r/WYz5dn/2
(?<=https://m\.)([^\s]+)(?=\?sfnsn)
You could use a capturing group instead of a positive lookbehind and match either ?sfnsn or the end of the string.
https://m\.(\S*?)(?:\?sfnsn|$)
Regex demo
Using the lookarounds, the pattern could be:
(?<=https://m\.)\S*?(?=\?sfnsn|$)
Regex demo
Putting a ? at the end works, since the last grouped lookahead may or may not exist, we put a question mark after it:
(?<=https://m\.)([^\s]+)(?=\?sfnsn)?

Regex, capture using word boundaries without stopping at "dot" and/or other characters

Given for example a string like this:
random word, random characters##?, some dots. username bob.1234 other stuff
I'm currently using this regex to capture the username (bob.1234):
\busername (.+?)(,| |$)
But my code needs a regex with only one capture group as python's re.findall returns something different when there are multiple capture groups. Something like this would almost work, except it will capture the username "bob" instead of "bob.1234":
\busername (.+?)\b
Anybody knows if there is a way to use the word boundary while ignoring the dot and without using more than one capture group?
NOTES:
Sometimes there is a comma after the username
Sometimes there is a space after the username
Sometimes the string ends with the username
The \busername (.+?)(,| |$) pattern contains 2 capturing groups, and re.findall will return a list of tuples once a match is found. See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, there are three approaches here:
Use a (?:...) non-capturing group rather than the capturing one: re.findall(r'\busername (.+?)(?:,| |$)', s). It will consume a , or space, but since only captured part will be returned and no overlapping matches are expected, it is OK.
Use a positive lookahead instead: re.findall(r'\busername (.+?)(?=,| |$)', s). The space and comma will not be consumed, that is the only difference from the first approach.
You may turn the (.+?)(,| |$) into a simple negated character class [^ ,]+ that matches one or more chars other than a space or comma. It will match till end of string if there are no , or space after username.

NOT operator for regex

Using python script, I am cleaning a piece of text where I want to replace following words:
promocode, promo, code, coupon, coupon code, code.
However, I dont want to replace them if they start with a '#'. Thus, #promocode, #promo, #code, #coupon should remain the way they are.
I tried following regex for it:
1. \b(promocode|promo code|promo|coupon code|code|coupon)\b
2. (?<!#)(promocode|promo code|promo|coupon code|code|coupon)
None of them are working. I am basically looking something that will allow me to say "Does NOT start with # and" (promocode|promo code|promo|coupon code|code|coupon)
Any suggestions ?
You need to use a negative look-behind:
(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b
This (?<!#) will ensure you will only match these words if there is no # before them and \b will ensure you only match whole words. The non-capturing group (?:...) is used just for grouping purposes so as not to repeat \b around each alternative in the list (e.g. \bpromo\b|\bcode\b...). Why use non-capturing group? So that it does not interfere with the Match result. We do not need unnecessary overhead with digging out the values (=groups) we need.
See demo here
See IDEONE demo, only the first promo is deleted:
import re
p = re.compile(r'(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b')
test_str = "promo #promo "
print(p.sub('', test_str))
A couple of words about your regular expressions.
The \b(promocode|promo code|promo|coupon code|code|coupon)\b is good, but it also matches the words in the alternation group not preceded with #.
The (?<!#)(promocode|promo code|promo|coupon code|code|coupon) regex is better, but you still do not match whole words (see this demo).

String pattern Regular Expression python

I am a novice in regular expressions. I have written the following regex to find abababab9 in the given string. The regular expression returns two results, however I was expecting one result.
testing= re.findall(r'((ab)*[0-9])',temp);
**Output**: [('abababab9', 'ab')]
According to my understanding, it should have returned only abababab9, why has it returned ab alone.
You didnt' read the findall documentation:
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has
more than one group.
Empty matches are included in the result.
And if you take a look at the re module capturing groups are subpatterns enclosed in parenthesis like (ab).
If you want to only get the complete match you can use one of the following solutions:
re.findall(r'(?:ab)*[0-9]', temp) # use non-capturing groups
[groups[0] for groups in re.findall(r'(ab)*[0-9]', temp)] # take the first group
[match.group() for match in re.finditer(r'(ab)*[0-9]', temp)] # use finditer
You have configured by (...) two matching groups the first group is ((ab)*[0-9]) and the second group is (ab). Therefore you get these two results. To get only the first group you could make the second a non-capturing group. This is done by ?:. So this result is not delivered.
((?:ab)*[0-9])
Debuggex Demo
This one only matches abababab9.
Edit 1:
Here is an explanation of the grouping concept of regular expressions: groups and capturing
Remove the second group capturing (ab) using ?: inside:
testing= re.findall(r'((?:ab)*[0-9])',temp);

Categories

Resources