Extract portions of text if Regex in Python

Extract portions of text if Regex in Python - python

I have a a previously matched pattern such as:
<a href="somelink here something">
Now I wish to extract only the value of a specific attribute(s) in the tag such but this may be anything an occur anywhere in the tag.
regex_pattern=re.compile('href=\"(.*?)\"')
Now I can use the above to match the attribute and the value part but I need to extract only the (.*?) part. (Value)
I can ofcourse strip href=" and " later but I'm sure I can use regex properly to extract only the required part.
In simple words I want to match
abcdef=\"______________________\"
in the pattern but want only the
____________________
Part
How do I do this?

Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.

Take a look at the .group() method on regular expression MatchObject results.
Your regular expression has an explicit group match group (the part in () parethesis), and the .group() method gives you direct access to the string that was matched within that group. MatchObject are returned by several re functions and methods, including the .search() and .finditer() functions.
Demonstration:
>>> import re
>>> example = '<a href="somelink here something">'
>>> regex_pattern=re.compile('href=\"(.*?)\"')
>>> regex_pattern.search(example)
<_sre.SRE_Match object at 0x1098a2b70>
>>> regex_pattern.search(example).group(1)
'somelink here something'
From the Regular Expression syntax documentation on the (...) parenthesis syntax:
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

Related

Replace second and last second characters, using re.sub

I have a string "F(foo)", and I'd like to replace that string with "F('foo')". I know we can also use regular expression in the second parameter and do this replacement using re.sub(r"F\(foo\)", r"F\('foo'\)",str). But the problem here is, foo is a dynamic string variable. It is different every time we want to do this replacement. Is it possible by some sort of regex, to do such replacement in a cleaner way?
I remember one way to extract foo using () and then .group(1). But this would require me to define one more temporary variable just to store foo. I'm curious if there is a way by which we can replace "F(foo)" with "F('foo')" in a single line or in other words in a more cleaner way.
Examples :
F(name) should be replaced with F('name').
F(id) should be replaced with F('id').
G(name) should not be replaced.
So, the regex would be r"F\((\w)+\)" to find such strings.

Using re.sub
Ex:
import re
s = "F(foo)"
print(re.sub(r"\((.*)\)", r"('\1')", s))
Output:
F('foo')

The following regex encloses valid [Python|C|Java] identifiers after F and in parentheses in single quotation marks:
re.sub(r"F\(([_a-z][_a-z0-9]+)\)", r"F('\1')", s, flags=re.I)
#"F('foo')"

There are several ways, depending on what foo actually is.
If it can't contain ( or ), you can just replace ( with (' and ) with '). Otherwise, try using
re.sub(r"F\((.*)\)", r"F('\1')", yourstring)
where the \1 in the replacement part will reference the (.*) capture group in the search regex

In your pattern F\((\w)+\) you are almost there, you just need to put the quantifier + after the \w to repeat matching 1+ word characters.
If you put it after the capturing group, you repeat the group which will give you the value of the last iteration in the capturing group which would be the second o in foo.
You could update your expression to:
F\((\w+)\)
And in the replacement refer to the capturing group using \1
F('\1')
For example:
import re
str = "F(foo)"
print(re.sub(r"F\((\w+)\)", r"F('\1')", str)) # F('foo')
Python demo

python regular expression replace only in parentheses

I would like to replace the ー to － in a regular expression like \d+(ー)\d+(ー)\d+. I tried re.sub but it will replace all the text including the numbers. Is it possible to replace the word in parentheses only?
e.g.
sub('\d+(ー)\d+(ー)\d+','4ー3ー1','-') returns '4-3-1'. Assume that simple replace cannot be used because there are other ー that do not satisfy the regular expression. My current solution is to split the text and do replacement on the part which satisfy the regular expression.

You may use the Group Reference here.
import re
before = '4ー3ー1ーー4ー31'
after = re.sub(r'(\d+)ー(\d+)ー(\d+)', r'\1-\2-\3', before)
print(after) # '4-3-1ーー4ー31'
Here, r'\1' is the reference to the first group, a.k.a, the first parentheses.

You could use a function for the repl argument in re.sub to only touch the match groups.
import re
s = '1234ー2134ー5124'
re.sub("\d+(ー)\d+(ー)\d+", lambda x: x.group(0).replace('ー', '-'), s)
Using a slightly different pattern, you might be able to take advantage of a lookahead expression which does not consume the part of string it matches to. That is to say, a lookahead/lookbehind will match on a pattern with the condition that it also matches the component in the lookahead/lookbehind expression (rather than the entire pattern.)
re.sub("ー(?=\d+)", "-", s)
If you can live with a fixed-length expression for the part preceding the emdash you can combine the lookahead with a lookbehind to make the regex a little more conservative.
re.sub("(?<=\d)ー(?=\d+)", "-", s)

re.sub('\d+(ー)\d+(ー)\d+','4ー3ー1','-')
Like you pointed out, the output of the regular expression will be '-'. because you are trying to replace the entire pattern with a '-'. to replace the ー to － you can use
import re
input_string = '4ー3ー1'
re.sub('ー','-', input_string)
or you could do a find all on the digits and join the string with a '-'
'-'.join(re.findall('\d+', input_string))
both methods should give you '4-3-1'

Named backreference (?P=name) issue in Python re

I am learning 're' part of Python, and the named pattern (?P=name) confused me,
When I using re.sub() to make some exchange for digit and character, the patter '(?P=name)' doesn't work, but the pattern '\N' and '\g<name>' still make sense. Code below:
[IN]print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd'))
[OUT] (?P=char)-(?P=digit)
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))
[OUT] abcd-123
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))
[OUT] abcd-123
Why it failed to make substitute when I use (?P=name)?
And how to use it correctly?
I am using Python 3.5

The (?P=name) is an inline (in-pattern) backreference. You may use it inside a regular expression pattern to match the same content as is captured by the corresponding named capturing group, see the Python Regular Expression Syntax reference:
(?P=name)
A backreference to a named group; it matches whatever text was matched by the earlier group named name.
See this demo: (?P<digit>\d{3})-(?P<char>\w{4})&(?P=char)-(?P=digit) matches 123-abcd&abcd-123 because the "digit" group matches and captures 123, "char" group captures abcd and then the named inline backreferences match abcd and 123.
To replace matches, use \1, \g<1> or \g<char> syntax with re.sub replacement pattern. Do not use (?P=name) for that purpose:
repl can be a string or a function... Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern... In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

You can check the details of using and back-referencing ?P visiting:
https://docs.python.org/3/library/re.html
and using CTRL+F in your browser to look for (?P...). It comes a nice chart with all the instructions about when you can make use of ?P=name.
For this example, you're doing right at your third re.sub() call.
In the all re.sub() calls you can only use the ?P=name syntax in the first string parameter of this method and you don't need it in the second string parameter because you have the \g syntax.
In case you're confuse about the ?P=name being useful, it is, but for making a match by backreferencing an already named string.
Example: you want to match potatoXXXpotato and replace it for YYXXXYY. You could make:
re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato')
or
re.sub(r'(?P<myName>potato)(?P<triple>XXX)(?P=myName)', r'YY\g<triple>YY', 'potatoXXXpotato')

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks

Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Python regex - (\w+) results different output when used with complex expression

I have doubt on python regex operation. Here you go my sample test.
>>>re.match(r'(\w+)','a-b') gives an output
>>> <_sre.SRE_Match object at 0x7f51c0033210>
>>>re.match(r'(\w+):(\d+)','a-b:1')
>>>
Why does the 2nd regex condition doesn't give match object though the 1st regex gives match object for a normal string match condition, irrespective of special characters is available in the string?
However, \w+ will matches for [a-z,A-Z,_]. I'm not clear why (\w+) gives matched object for the string 'a-b'. How can I check whether the given string doesn't contain any special characters?

Taking a look at the actual match will give you an idea of what happens.
>>> re.match(r'(\w+)', 'a-b')
<_sre.SRE_Match object at 0x0000000002DE45D0>
>>> _.groups()
('a',)
As you can see, the expression matched a. The character sequence \w only contains actual word characters, but not separators like dashes. So you can’t actually match a-b using just a \w+.
Now in the second expression one might think that it would match b:1 at least, given that \w+ matches b and :(\d+) does match the 1. However it does not happen due to how re.match works. As the documentation hints, it only tries to match “at the beginning of string”. So when using re.match there is an implicit ^ at the beginning of the expression that makes it only match from the start. So it actually tries to find a match starting with a.
Instead, you can use re.search which actually looks in the whole string if it can match the expression anywhere. So there, you will get a result:
>>> re.search(r'(\w+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('b', '1')
For further information on the search vs. match topic, check this section in the manual.
And finally, if you want to match dashes too, you can use a character sequence [\w-] for example:
>>> re.match(r'([\w-]+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('a-b', '1')

The first matches the a - one or more word chars.
The second is one or more word chars immediately followed by a : which there aren't...
[a-z,A-Z,_] (the equivalent of \w) means a to z and A to Z - it isn't the literal hyphen in this context, if you did want a hyphen, put it as the first or last character of a character class.

Match's docs say
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding MatchObject
instance.
match method will return the matched object if it finds a match at the beginning of the string. (\w+) matches a in a-b.
print re.match(r'(\w+)','a-b').group()
will print
a
In the second case ((\w+):(\d+)), the actual string which gets matched is b:1, which is not at the beginning of the string. That's why its returning None.
How can I check whether the given string doesn't contain any special characters?
I would say, the second regular expression which you have used should be enough and match function should be enough. I insist on match, since there are differences between match and search http://docs.python.org/2.7/library/re.html#search-vs-match
Remember, you

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract portions of text if Regex in Python - python

Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.

Related

Replace second and last second characters, using re.sub

python regular expression replace only in parentheses

Named backreference (?P=name) issue in Python re

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

Python regex - (\w+) results different output when used with complex expression

Categories

Resources