python regular expression grouping - python

My regular expression goal:
"If the sentence has a '#' in it, group all the stuff to the left of the '#' and group all the stuff to the right of the '#'. If the character doesn't have a '#', then just return the entire sentence as one group"
Examples of the two cases:
A) '120x4#Words' -> ('120x4', 'Words')
B) '120x4#9.5' -> ('120x4#9.5')
I made a regular expression that parses case A correctly
(.*)(?:#(.*))
# List the groups found
>>> r.groups()
(u'120x4', u'words')
But of course this won't work for case B -- I need to make "# and everything to the right of it" optional
So I tried to use the '?' "zero or none" operator on that second grouping to indicate it's optional.
(.*)(?:#(.*))?
But it gives me bad results. The first grouping eats up the entire string.
# List the groups found
>>> r.groups()
(u'120x4#words', None)
Guess I'm either misunderstanding the none-or-one '?' operator and how it works on groupings or I am misunderstanding how the first group is acting greedy and grabbing the entire string. I did try to make the first group 'reluctant', but that gave me a total no-match.
(.*?)(?:#(.*))?
# List the groups found
>>> r.groups()
(u'', None)

Simply use the standard str.split function:
s = '120x4#Words'
x = s.split( '#' )
If you still want a regex solution, use the following pattern:
([^#]+)(?:#(.*))?

(.*?)#(.*)|(.+)
this sjould work.See demo.
http://regex101.com/r/oC3nN4/14

use re.split :
>>> import re
>>> a='120x4#Words'
>>> re.split('#',a)
['120x4', 'Words']
>>> b='120x4#9.5'
>>> re.split('#',b)
['120x4#9.5']
>>>

Here's a verbose re solution. But, you're better off using str.split.
import re
REGEX = re.compile(r'''
\A
(?P<left>.*?)
(?:
[#]
(?P<right>.*)
)?
\Z
''', re.VERBOSE)
def parse(text):
match = REGEX.match(text)
if match:
return tuple(filter(None, match.groups()))
print(parse('120x4#Words'))
print(parse('120x4#9.5'))
Better solution
def parse(text):
return text.split('#', maxsplit=1)
print(parse('120x4#Words'))
print(parse('120x4#9.5'))

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

Replace a character enclosed with lowercase letters

All the examples I've found on stack overflow are too complicated for me to reverse engineer.
Consider this toy example
s = "asdfasd a_b dsfd"
I want s = "asdfasd a'b dsfd"
That is: find two characters separated by an underscore and replace that underscore with an apostrophe
Attempt:
re.sub("[a-z](_)[a-z]","'",s)
# "asdfasd ' dsfd"
I thought the () were supposed to solve this problem?
Even more confusing is the fact that it seems that we successfully found the character we want to replace:
re.findall("[a-z](_)[a-z]",s)
#['_']
why doesn't this get replaced?
Thanks
Use look-ahead and look-behind patterns:
re.sub("(?<=[a-z])_(?=[a-z])","'",s)
Look ahead/behind patterns have zero width and thus do not replace anything.
UPD:
The problem was that re.sub will replace the whole matched expression, including the preceding and the following letter.
re.findall was still matching the whole expression, but it also had a group (the parenthesis inside), which you observed. The whole match was still a_b
lookahead/lookbehind expressions check that the search is preceded/followed by a pattern, but do not include it into the match.
another option was to create several groups, and put those groups into the replacement: re.sub("([a-z])_([a-z])", r"\1'\2", s)
When using re.sub, the text to keep must be captured, the text to remove should not.
Use
re.sub(r"([a-z])_(?=[a-z])",r"\1'",s)
See proof.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of look-ahead
Python code:
import re
s = "asdfasd a_b dsfd"
print(re.sub(r"([a-z])_(?=[a-z])",r"\1'",s))
Output:
asdfasd a'b dsfd
The re.sub will replace everything it matched .
There's a more general way to solve your problem , and you do not need to re-modify your regular expression.
Code below:
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0)) if _reg else r
#
re.sub(reg,repl, s)
output: 'Data: year=am_replace_str, monthday=am_replace_str, month=am_replace_str, some other text'
Of course, if your case does not contain groups , your code may like this :
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0))
#
re.sub(reg,repl, s)

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

regex match special case

I have a cinematic scenario with a bunch of strings like this:
80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla
And my goal is to match all the numbers up to : in selected sequence (selected is 80101_ in this example, strings #2, #3, #5, #6), matching strings without existing numbers (like 80101_:Blablab, string #4) but without matching the string with _intertitle (string #1).
My current regex looks like this (code in Python):
selection = "80101"; # I'm getting this from elsewhere
pattern = selection + "_" + "\d*";
This matches all the strings with/without numbers but also a string with _intertitle. If I modify my pattern like this "\d[^:]*", it doesn't match _intertitle but also doesn't match the string without numbers... I can't get the right pattern, could anyone please lead me in the right direction? Thanks.
I think you should add "(?=:)" in the and of your pattern:
r"80101_\d*(?=:)"
This means: select "80101_" + zero or more digits only if it’s followed by ":". In case of "80101_intertitle:Blablabla" we have a non-digit symbol between "80101_" and ":", so it doesn't match.
You could use a negative lookahead:
80101_\d*(?!intertitle)
That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used.
regex101 demo
Your pattern could be written as:
pattern = selection + r"_\d*(?!intertitle)"
You need anchors and multiline flag. Also, you should add the :.* at the end of the regex as well to match the whole string.
^80101_\d*:.*$
See the Demo: https://regex101.com/r/yqGgrv/1
Here is the respective python code as well:
In [1]: s = """80101_intertitle:Blablabla
...: 80101_1:BlablablaBlablabla
...: 80101_2:Blablabla
...: 80101_:BlablablaBlablablaBlablabla
...: 80101_3:BlablablaBlablabla
...: 80101_11:Blablabla
...: 801_1:Blablabla
...: 801_2:Blablabla"""
In [2]: import re
In [4]: re.findall(r'^80101_\d*:.*$', s, re.M)
Out[4]:
['80101_1:BlablablaBlablabla',
'80101_2:Blablabla',
'80101_:BlablablaBlablablaBlablabla',
'80101_3:BlablablaBlablabla',
'80101_11:Blablabla']
Yes, that is easily done:
import re
s = '''80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla'''
matches = re.findall(r'(80101_\d+:.*)', s)
for match in matches:
print(match)
matches = re.findall(r'(80101_:.*)', s)
for match in matches:
print(match)

Python Regular Expression Matching: ## ##

I'm searching a file line by line for the occurrence of ##random_string##. It works except for the case of multiple #...
pattern='##(.*?)##'
prog=re.compile(pattern)
string='lala ###hey## there'
result=prog.search(string)
print re.sub(result.group(1), 'FOUND', string)
Desired Output:
"lala #FOUND there"
Instead I get the following because its grabbing the whole ###hey##:
"lala FOUND there"
So how would I ignore any number of # at the beginning or end, and only capture "##string##".
To match at least two hashes at either end:
pattern='##+(.*?)##+'
Your problem is with your inner match. You use ., which matches any character that isn't a line end, and that means it matches # as well. So when it gets ###hey##, it matches (.*?) to #hey.
The easy solution is to exclude the # character from the matchable set:
prog = re.compile(r'##([^#]*)##')
Protip: Use raw strings (e.g. r'') for regular expressions so you don't have to go crazy with backslash escapes.
Trying to allow # inside the hashes will make things much more complicated.
EDIT: If you do not want to allow blank inner text (i.e. "####" shouldn't match with an inner text of ""), then change it to:
prog = re.compile(r'##([^#]+)##')
+ means "one or more."
'^#{2,}([^#]*)#{2,}' -- any number of # >= 2 on either end
be careful with using lazy quantifiers like (.*?) because it'd match '##abc#####' and capture 'abc###'. also lazy quantifiers are very slow
Try the "block comment trick": /##((?:[^#]|#[^#])+?)##/
Adding + to regex, which means to match one or more character.
pattern='#+(.*?)#+'
prog=re.compile(pattern)
string='###HEY##'
result=prog.search(string)
print result.group(1)
Output:
HEY
have you considered doing it non-regex way?
>>> string='lala ####hey## there'
>>> string.split("####")[1].split("#")[0]
'hey'
>>> import re
>>> text= 'lala ###hey## there'
>>> matcher= re.compile(r"##[^#]+##")
>>> print matcher.sub("FOUND", text)
lala #FOUND there
>>>

Categories

Resources