Regular expressions and how to get captured values - python

custom = 'number=value1;user=value2;yr=value3'
number=re.findall('number=(.+?);',custom)
user=re.findall('user=(.+?);',custom)
yr=re.findall('yr=(.+?)[;\w]',custom))
outcome:
print number
value1
I am trying to extract value of number, user,and yr from custom. It is working except 'yr', because since 'yr' is last word it does not end with ';'. I tried adding \w, but not working. Is there way to add ends with either ';' or end of string? I could search for custom[-1], but I just want to know how to do in regex, and yr is not always last; number or user can be last sometimes.

You can leverage the regex lookbehind and use a regex like this:
(?<==)(\w+)
Working demo
So, you can use this regex for each case:
(?<=number=)(\w+)
(?<=user=)(\w+)
(?<=yr=)(\w+)
You can have your code as this:
custom = 'number=value1;user=value2;yr=value3'
number=re.findall('(?<=number=)(\w+)',custom)
user=re.findall('(?<=user=)(\w+)',custom)
yr=re.findall('(?<=yr=)(\w+)',custom))
outcome:
print number
value1
Update: as CommuSoft pointed in his comment, the regex won't capture the content if you have spaces. So, you can improve the regex by using:
(?<==)([^;]+)
So, you can have for each parameter something like this:
(?<=number=)([^;]+)
(?<=user=)([^;]+)
(?<=yr=)([^;]+)
Working demo

\w matches any word character but you want to match the end of the string.
You can use instead:
yr=(.+?)(?:;|$)
Also for learning/debugging regexes there are regex testers like this one:
https://regex101.com/

\w means a word character. Now since you made the regex "ungreedy" the regex wants to terminate the group as soon as possible, so it will match only the first character, and match the remainder with \w. You can however use:
(;|$)
So this results in:
yr=re.findall('yr=(.+?)(?:;|$)',custom)
which gives the correct result
The reason the ?: is added in the front is because you don't want to capture it (show it in the output).

Try this:
number, user, yr = re.findall('(?<==)[^;]+', custom)
print number, user, yr
Result: value1 value2 value3

Related

Single regular expression for extracting different values

I have some inputs like
ID= 5657A
ID=PID=FSGDVD
IDS=5645SD
I have created a regex i.e IDS=[A-Za-z0-9]+|ID=[A-Za-z0-9]+|PID=[A-Za-z0-9]+. But, in the case of ID=PID=FSGDVD, I want PID=FSGDVD as output.
My outputs must look like
ID= 5657A
PID=FSGDVD
IDS=5645SD
How to go for this problem?
Add end of line anchor and use grouping and quantifiers to simplify the regex:
(?:IDS?|PID)=[A-Za-z0-9]+$
IDS? will match both ID and IDS
(?:IDS?|PID) will match ID or IDS or PID
(?:pattern) is a non-capturing group, some functions like re.split and re.findall will change their behavior based on capture groups, thus non-capturing group is ideal whenever backreferences aren't needed
$ is end of line anchor, thus you'll get the match towards end of line instead of start of line
Demo: https://regex101.com/r/e9uvmC/1
In case your input can be something like ID=PID=FSGDVD xyz then you could use lookarounds:
(?:IDS?|PID)=[A-Za-z0-9]+\b(?!=)
Here \b will ensure to match all word characters after = sign and (?!=) is a negative lookahead assertion to avoid a match if there is = afterwards
Demo: https://regex101.com/r/e9uvmC/2
Another one could be
[A-Z]+=\s*[^=]+$
See a demo on regex101.com.

How to match substring or whole string

In Python regex, how would I match only the facebook.com...777 substrings given either string? I don't want the ?sfnsn=mo at the end.
I have (?<=https://m\.)([^\s]+) to match everything after the https://m.. I also have (?=\?sfnsn) to match every thing in front of ?sfnsn.
How do I combine the regex to only return the facebook.com...777 part for either string.
have: https://m.facebook.com/story.php?story_fbid=123456789&id=7777777777?sfnsn=mo
want: facebook.com/story.php?story_fbid=123456789&id=7777777777
have: https://m.facebook.com/story.php?story_fbid=123456789&id=7777777777
want: facebook.com/story.php?story_fbid=123456789&id=7777777777
Here's what I was messing around with https://regex101.com/r/WYz5dn/2
(?<=https://m\.)([^\s]+)(?=\?sfnsn)
You could use a capturing group instead of a positive lookbehind and match either ?sfnsn or the end of the string.
https://m\.(\S*?)(?:\?sfnsn|$)
Regex demo
Using the lookarounds, the pattern could be:
(?<=https://m\.)\S*?(?=\?sfnsn|$)
Regex demo
Putting a ? at the end works, since the last grouped lookahead may or may not exist, we put a question mark after it:
(?<=https://m\.)([^\s]+)(?=\?sfnsn)?

Python regex match pattern "X<string1>:X<string2>"

I'm parsing a file which has text "$string1:$string2"
How do I regex match this string and extract "string1" and "string2" from it, basically regex match this pattern : "$*:$*"
You were nearly there with your own pattern, it needs three alterations in order to work as you want it.
First, the star in regexes isn't a glob, as you might be expecting it from shell scripting, it's a kleene star. Meaning, it needs some character group it can apply it's "zero to n times" logic on. In your case, the alphanumeric character class \w should work. If that's too restrictive, use . instead, which matches any character except line breaks.
Secondly, you need to apply the regex in a way that you can easily extract the results you want. The usual way to go about it is to define groups, using parentheses.
Last but not least, the $ sign is a meta-character in regexes, so if you want to match it literally, you need to write a backslash in front of it.
In working code, it'll look like this:
import re
s = "$string1:$string2"
r = re.compile(r"\$(\w*):\$(\w*)")
match = r.match(s)
print(match.group(1)) # print the first group that was matched
print(match.group(2)) # print the second group that was matched
Output:
string1
string2

Regex only finds results once

I'm trying to find any text between a '>' character and a new line, so I came up with this regex:
result = re.search(">(.*)\n", text).group(1)
It works perfectly with only one result, such as:
>test1
(something else here)
Where the result, as intended, is
test1
But whenever there's more than one result, it only shows the first one, like in:
>test1
(something else here)
>test2
(something else here)
Which should give something like
test1\ntest2
But instead just shows
test1
What am I missing? Thank you very much in advance.
re.search only returns the first match, as documented:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
To find all the matches, use findall.
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found.
Here's an example from the shell:
>>> import re
>>> re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx")
['test1', 'test2']
Edit: I just read your question again and realised that you want "test1\ntest2" as output. Well, just join the list with \n:
>>> "\n".join(re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx"))
'test1\ntest2'
You could try:
y = re.findall(r'((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+)', text)
Which returns ['t1\nt2\nt3'] for 't1\nt2\nt3\n'. If you simply want the string, you can get it by:
s = y[0]
Although it seems much larger than your initial code, it will give you your desired string.
Explanation -
((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+) is the regex as well as the match.
(?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|)) is the non-capturing group that matches any text followed by a newline, and is repeatedly found one-or-more times by the + after it.
(?:.+?) matches the actual words which are then followed by a newline.
(?:(?=[\n\r][^\n\r])\n|) is a non-capturing conditional group which tells the regex that if the matched text is followed by a newline, then it should match it, provided that the newline is not followed by another newline or carriage return
(?=[\n\r][^\n\r]) is a positive look-ahead which ascertains that the text found is followed by a newline or carriage return, and then some non-newline characters, which combined with the \n| after it, tells the regex to match a newline.
Granted, after typing this big mess out, the regex is pretty long and complicated, so you would be better off implementing the answers you understand, rather than this answer, which you may not. However, this seems to be the only one-line answer to get the exact output you desire.

Regexp - find a value between a part of the string and a second part of the string OR end of line

I've looked through many regexp examples here, but still fail to find a solution.
I have to check a request string for a certain substring in it. The substring in question will have something before it might have something after:
?something=xxx&to_dep=YYY&from_dep=zzz&...
OR
?something=xxx&to_dep=YYY
I need to extract YYY without a & in first case and simply YYY in the second case.
For now I use this kind of regexp:
re.search('to_dep=(.+?)&', req.query_string)
but works only in one case and can't be used if I want to re.sub it. (replace YYY with something else - & gets replaced too)
Any help?
Just try with:
[?&]to_dep=([^&]*)
[^&]* will match any characters that are not & or it will stop on the next & (first case) or stop on the end of the string (second case).
For both, you might use a positive lookbehind and a negated class:
re.search(r'(?<=to_dep=)[^&]+', req.query_string)
And this will give you only YYY, which then means you can also use it in re.sub:
re.sub(r'(?<=to_dep=)[^&]+', 'new_value', req.query_string)
[^&] matches any character except &.
(?<=to_dep=) makes sure there's a to_dep= before the part to match.

Categories

Resources