Python re.search regex using or condition - python

Hi Friends I'm trying to include "or |" condition in search pattern using re.search. Can someone help me how to achieve or condition as I'm not getting match.
Below code works
>>> pattern = re.escape('apple.fruit[0]')
>>> sig = 'apple.fruit[0]'
>>> if re.search(pattern, sig):
... print("matched")
...
matched
>>> pattern = re.escape('apple.fruit[0] or vegi[0]')
>>> if re.search(pattern, sig):
... print("matched")
...
>>>
I want to match above string "apple." followed by fruit[0] or vegi[0]

Regex or should be achieved through | operator and we don't inculde this inside re.escape. If you do so, then it would loose it's special meaning.
pattern = re.escape('apple.fruit[0]')+ '|' + re.escape('vegi[0]')
or
pattern = r'apple\.fruit\[0\]|vegi\[0\]'

Related

Python re regex sub letters not surrounded in quotes and not if they match specific word including regex group / match

I need to sub letters not surrounded in quotes and not if they match the word TODAY with a particular string where a part of it includes the match group e.g.
import re
import string
s = 'AB+B+" HELLO"+TODAY()/C* 100'
x = re.sub(r'\"[^"]*\"|\bTODAY\b|([A-Z]+)', r'a2num("\g<0>")', s)
print (x)
expected output:
'a2num("AB")+a2num("B")+" HELLO"+TODAY()/a2num("C")* 100'
actual output:
'a2num("AB")+a2num("B")+a2num("" HELLO"")+a2num("TODAY")()/a2num("C")* 100'
I am nearly there but it is not obeying the quote rules or the TODAY word rule, I know the string doesn't make any sense but it's just a harsh test of the regex
Your regex approach is correct but you need to use a lambda function in re.sub
>>> s = 'AB+B+" HELLO"+TODAY()/C* 100'
>>> rs = re.sub(r'"[^"]*"|\bTODAY\b|\b([A-Z]+)\b',
... lambda m: 'a2num("' + m.group(1) + '")' if m.group(1) else m.group(), s)
>>> print (rs)
a2num("AB")+a2num("B")+" HELLO"+TODAY()/a2num("C")* 100
Code Demo

Combining two patterns with named capturing group in Python?

I have a regular expression which uses the before pattern like so:
>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:)([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:I118uailfriedx151201005423521">>')
>>> x.group('sid')
'I118uailfriedx151201005423521'
and another like so:
>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:<<")([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:<<"I118uailfriedx151201005423521')
>>> x.group('sid')
'I118uailfriedx151201005423521'
How can I combine these two patterns in a way that, after parsing these two different lines,:
sid:A111uancalual2626x151130185758596
sid:<<"I118uailfriedx151201005423521">>
returns only the corresponding id to me.
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9]+))')
Use this, I've just tested and it is working for me. I've moved some part out.
Instead of tweaking your regex, you can make your strings easier to parse by just removing any characters except alphanumeric and a colon. Then, just split by colon and get the last item:
>>> import re
>>>
>>> test_strings = ['sid:I118uailfriedx151201005423521">>', 'sid:<<"I118uailfriedx151201005423521']
>>> pattern = re.compile(r"[^A-Za-z0-9:]")
>>> for test_string in test_strings:
... print(pattern.sub("", test_string).split(":")[-1])
...
I118uailfriedx151201005423521
I118uailfriedx151201005423521
You can achieve what you want with a single regex:
\bsid:\W*(?P<sid>\w+)
See the regex demo
The regex breakdown:
\bsid - whole word sid
: - a literal colon
\W* - zero or more non-word characters
(?P<sid>\w+) - one or more word characters captured into a group named "sid"
Python demo:
import re
p = re.compile(r'\bsid:\W*(?P<sid>\w+)')
#test_str = "sid:I118uailfriedx151201005423521\">>" # => I118uailfriedx151201005423521
test_str = "sid:<<\"I118uailfriedx151201005423521" # => I118uailfriedx151201005423521
m = p.search(test_str)
if m:
print(m.group("sid"))

python regular expression grouping

My regular expression goal:
"If the sentence has a '#' in it, group all the stuff to the left of the '#' and group all the stuff to the right of the '#'. If the character doesn't have a '#', then just return the entire sentence as one group"
Examples of the two cases:
A) '120x4#Words' -> ('120x4', 'Words')
B) '120x4#9.5' -> ('120x4#9.5')
I made a regular expression that parses case A correctly
(.*)(?:#(.*))
# List the groups found
>>> r.groups()
(u'120x4', u'words')
But of course this won't work for case B -- I need to make "# and everything to the right of it" optional
So I tried to use the '?' "zero or none" operator on that second grouping to indicate it's optional.
(.*)(?:#(.*))?
But it gives me bad results. The first grouping eats up the entire string.
# List the groups found
>>> r.groups()
(u'120x4#words', None)
Guess I'm either misunderstanding the none-or-one '?' operator and how it works on groupings or I am misunderstanding how the first group is acting greedy and grabbing the entire string. I did try to make the first group 'reluctant', but that gave me a total no-match.
(.*?)(?:#(.*))?
# List the groups found
>>> r.groups()
(u'', None)
Simply use the standard str.split function:
s = '120x4#Words'
x = s.split( '#' )
If you still want a regex solution, use the following pattern:
([^#]+)(?:#(.*))?
(.*?)#(.*)|(.+)
this sjould work.See demo.
http://regex101.com/r/oC3nN4/14
use re.split :
>>> import re
>>> a='120x4#Words'
>>> re.split('#',a)
['120x4', 'Words']
>>> b='120x4#9.5'
>>> re.split('#',b)
['120x4#9.5']
>>>
Here's a verbose re solution. But, you're better off using str.split.
import re
REGEX = re.compile(r'''
\A
(?P<left>.*?)
(?:
[#]
(?P<right>.*)
)?
\Z
''', re.VERBOSE)
def parse(text):
match = REGEX.match(text)
if match:
return tuple(filter(None, match.groups()))
print(parse('120x4#Words'))
print(parse('120x4#9.5'))
Better solution
def parse(text):
return text.split('#', maxsplit=1)
print(parse('120x4#Words'))
print(parse('120x4#9.5'))

Do not match if word appears in regex

I have a url, and I want it to NOT match if the word 'season' is contained in the url. Here are two examples:
CONTAINS SEASON, DO NOT MATCH
'http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7'
DOES NOT CONTAIN SEASON, MATCH
'http://imdb.com/title/tt0285331/
Here is what I have so far, but I'm afraid the .+ will match everything until the end. What would be the correct regex to use here?
r'http://imdb.com/title/tt(\d)+/.+^[season].+'
Use a negative lookahead:
urls='''\
http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
http://imdb.com/title/tt0285331/'''
import re
print re.findall(r'^(?!.*\bseason\b)(.*)', urls, re.M)
# ['http://imdb.com/title/tt0285331/']
You cannot use whole words inside of character classes, you have to use a Negative Lookahead.
>>> s = '''
http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
http://imdb.com/title/tt0285331/
http://imdb.com/title/tt1111111/episodes?this=2
http://imdb.com/title/tt0123456/episodes?this=1&season=1&ref_=tt_eps_sn_1'''
>>> import re
>>> re.findall(r'\bhttp://imdb.com/title/tt(?!\S+\bseason)\S+', s)
# ['http://imdb.com/title/tt0285331/', 'http://imdb.com/title/tt0285331/episodes?this=2']
Use a negative lokahead just after to tt\d+/,
>>> import re
>>> s = """http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
... http://imdb.com/title/tt0285331/
... """
>>> m = re.findall(r'^http://imdb.com/title/tt\d+/(?:(?!season).)*$', s, re.M)
>>> for i in m:
... print i
...
http://imdb.com/title/tt0285331/

python 3 regular expression match string meta-character

I want to write a line of regular expression that can match strings like "(2000)" with years in parentheses. then I can check if any string contains the substring "2000".
for example, I want the regex to match (2000) not 2000, or (20000),or (200).
That is to say: they have to have exactly four digits, the first digit between 1 and 2; they have to include the parentheses.
also 2000 is just an example I use but really I want to the regex to include all the possible years.
You have to escape the open and close paranthesis,
>>> import re
>>> str = """foo(2000)bar(1000)foobar2000"""
>>> regex = r'\(2000\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)']
OR
>>> import re
>>> str = """foo(2000)bar(1000)foobar(2014)barfoo(2020)"""
>>> regex = r'\([0-9]{4}\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)', '(1000)', '(2014)', '(2020)']
It matches all the four digit numbers(year's) present within the paranthesis.
Special characters need to be escaped with a backslash. A parenthesis ( becomes \(. Therefore (2000) becomes \(2000\).
Then you can do something like:
if re.search(r"\(2000\)", subject):
# Successful match
else:
# Match attempt failed
>>> import re
>>> x = re.match(r'\((\d*?)\)', "(2000)")
>>> x.group(1)
'2000'

Categories

Resources