regex match special case

regex match special case - python

I have a cinematic scenario with a bunch of strings like this:
80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla
And my goal is to match all the numbers up to : in selected sequence (selected is 80101_ in this example, strings #2, #3, #5, #6), matching strings without existing numbers (like 80101_:Blablab, string #4) but without matching the string with _intertitle (string #1).
My current regex looks like this (code in Python):
selection = "80101"; # I'm getting this from elsewhere
pattern = selection + "_" + "\d*";
This matches all the strings with/without numbers but also a string with _intertitle. If I modify my pattern like this "\d[^:]*", it doesn't match _intertitle but also doesn't match the string without numbers... I can't get the right pattern, could anyone please lead me in the right direction? Thanks.

I think you should add "(?=:)" in the and of your pattern:
r"80101_\d*(?=:)"
This means: select "80101_" + zero or more digits only if it’s followed by ":". In case of "80101_intertitle:Blablabla" we have a non-digit symbol between "80101_" and ":", so it doesn't match.

You could use a negative lookahead:
80101_\d*(?!intertitle)
That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used.
regex101 demo
Your pattern could be written as:
pattern = selection + r"_\d*(?!intertitle)"

You need anchors and multiline flag. Also, you should add the :.* at the end of the regex as well to match the whole string.
^80101_\d*:.*$
See the Demo: https://regex101.com/r/yqGgrv/1
Here is the respective python code as well:
In [1]: s = """80101_intertitle:Blablabla
...: 80101_1:BlablablaBlablabla
...: 80101_2:Blablabla
...: 80101_:BlablablaBlablablaBlablabla
...: 80101_3:BlablablaBlablabla
...: 80101_11:Blablabla
...: 801_1:Blablabla
...: 801_2:Blablabla"""
In [2]: import re
In [4]: re.findall(r'^80101_\d*:.*$', s, re.M)
Out[4]:
['80101_1:BlablablaBlablabla',
'80101_2:Blablabla',
'80101_:BlablablaBlablablaBlablabla',
'80101_3:BlablablaBlablabla',
'80101_11:Blablabla']

Yes, that is easily done:
import re
s = '''80101_intertitle:Blablabla
80101_1:BlablablaBlablabla
80101_2:Blablabla
80101_:BlablablaBlablablaBlablabla
80101_3:BlablablaBlablabla
80101_11:Blablabla
801_1:Blablabla
801_2:Blablabla'''
matches = re.findall(r'(80101_\d+:.*)', s)
for match in matches:
print(match)
matches = re.findall(r'(80101_:.*)', s)
for match in matches:
print(match)

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.

You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position

You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.

This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

in regex how to match multiple Or conditions but exclude one condition

If I need to match a string "a" with any combination of symbols ##$ before and after it, such as #a#, #a#, $a$ etc, but not a specific pattern #a$. How can I exclude this? Suppose there're too many combinations to manually spell out one-by-one. And it's not negative lookahead or behind cases as seen in other SO answers.
import re
pattern = "[#|#|&]a[#|#|&]"
string = "something#a&others"
re.findall(pattern, string)
Currently the pattern returns results like '#a&' as expected, but also wrongly return on the string to be excluded. The correct pattern should return [] on re.findall(pattern,'#a$')

You can use the character class to list all the possible characters, and use a single negative lookbehind after the match to assert not #a$ directly to the left.
Note that you don't need the | in the character class, as it would match a pipe char and is the same as [#|#&]
[##&$]a[##&$](?<!#a\$)
Regex demo | Python demo
import re
pattern = r"[##&$]a[##&$](?<!#a\$)"
print(re.findall(pattern,'something#a&others#a$'))
Output
['#a&']

I was going to suggest a fairly ugly and complex regex pattern with lookarounds. But instead, you could just proceed with your current pattern and then use a list comprehension to remove the false positive case:
inp = "something#a&others #a$"
matches = re.findall(r'[##&$]+a[##&$]+', inp)
matches = [x for x in matches if x != '#a$']
print(matches) # ['#a&']

Insert space if uppercase letter is preceded and followed by one lowercase letter - Python

Is there a way to insert aspace if it contains a uppercase letter (but not the first letter)?
For example, given "RegularExpression" I´d like to obtain "Regular Expression".
I tried the following regex:
re.sub("[a-z]{1}[A-Z][a-z]{1}", " ","regularExpression")
Unfortunately, this deletes the matching pattern:
regula pression
I would prefer a regex solution, yet would be thankful for any working solution.
Thanks!

In [1]: s = 'RegularExpression'
In [2]: answer = []
In [3]: breaks = [i for i,char in enumerate(s) if char.isupper()]
In [4]: breaks = breaks[1:]
In [5]: answer.append(s[:breaks[0]])
In [6]: for start,end in zip(breaks, breaks[1:]):
...: answer.append(s[start:end])
...:
In [7]: answer.append(s[breaks[-1]:])
In [8]: answer
Out[8]: ['Regular', 'Expression']
In [9]: print(' '.join(answer))
Regular Expression

You can do this with the following:
import re
s = "RegularExpression"
re.sub(r"([A-Z][a-z]+)([A-Z][a-z]+)", r"\1 \2", s)
which means "put a space between the first match group and the second match group", where the match groups are a cap followed by one or more non-caps.

Try using Lookbehind "(?<=[a-z])([A-Z])"
Ex:
import re
s = "RegularExpression"
print(re.sub(r"(?<=[a-z])([A-Z])", r" \1", s))
Output:
Regular Expression

As I understand, when an uppercase letter is preceded by a lowercase letter you wish to insert a space between them. You can do that by using re.sub to replace (zero-width) matches of the following regular expression with a space.
r'(?<=[a-z])(?=[A-Z])'
Regex demo <¯\(ツ)/¯> Python code
Note that the SUBSTITUTION box at the regex demo link contains one space.
Python's regex engine performs the following operations.
(?<=[a-z]) : use a positive lookbehind to assert that the match is preceded
by a lowercase letter
(?=[A-Z]) : use a positive lookahead to assert that the match is followed
by an uppercase letter
For the string 'RegularExpression' the regex matches the location between the letters 'r' and 'E' (i.e., a zero-width match).

IIUC, one way using re.findall:
re.findall("[A-Z][a-z]+", "RegularExpression")
Output:
['Regular', 'Expression']

Replace a character enclosed with lowercase letters

All the examples I've found on stack overflow are too complicated for me to reverse engineer.
Consider this toy example
s = "asdfasd a_b dsfd"
I want s = "asdfasd a'b dsfd"
That is: find two characters separated by an underscore and replace that underscore with an apostrophe
Attempt:
re.sub("[a-z](_)[a-z]","'",s)
# "asdfasd ' dsfd"
I thought the () were supposed to solve this problem?
Even more confusing is the fact that it seems that we successfully found the character we want to replace:
re.findall("[a-z](_)[a-z]",s)
#['_']
why doesn't this get replaced?
Thanks

Use look-ahead and look-behind patterns:
re.sub("(?<=[a-z])_(?=[a-z])","'",s)
Look ahead/behind patterns have zero width and thus do not replace anything.
UPD:
The problem was that re.sub will replace the whole matched expression, including the preceding and the following letter.
re.findall was still matching the whole expression, but it also had a group (the parenthesis inside), which you observed. The whole match was still a_b
lookahead/lookbehind expressions check that the search is preceded/followed by a pattern, but do not include it into the match.
another option was to create several groups, and put those groups into the replacement: re.sub("([a-z])_([a-z])", r"\1'\2", s)

When using re.sub, the text to keep must be captured, the text to remove should not.
Use
re.sub(r"([a-z])_(?=[a-z])",r"\1'",s)
See proof.
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
--------------------------------------------------------------------------------
) end of look-ahead
Python code:
import re
s = "asdfasd a_b dsfd"
print(re.sub(r"([a-z])_(?=[a-z])",r"\1'",s))
Output:
asdfasd a'b dsfd

The re.sub will replace everything it matched .
There's a more general way to solve your problem , and you do not need to re-modify your regular expression.
Code below:
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0)) if _reg else r
#
re.sub(reg,repl, s)
output: 'Data: year=am_replace_str, monthday=am_replace_str, month=am_replace_str, some other text'
Of course, if your case does not contain groups , your code may like this :
import re
s = 'Data: year=2018, monthday=1, month=5, some other text'
reg = r"year=(\d{4}), monthday=(\d{1}), month=(\d{1})"
r = "am_replace_str"
def repl(match):
_reg = "|".join(match.groups())
return re.sub(_reg, r,match.group(0))
#
re.sub(reg,repl, s)

Use python 3 regex to match a string in double quotes

I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.

In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"

You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'

You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex match special case - python

I think you should add "(?=:)" in the and of your pattern: r"80101_\d*(?=:)" This means: select "80101_" + zero or more digits only if it’s followed by ":". In case of "80101_intertitle:Blablabla" we have a non-digit symbol between "80101_" and ":", so it doesn't match.

You could use a negative lookahead: 80101_\d(?!intertitle) That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used. regex101 demo Your pattern could be written as: pattern = selection + r"_\d(?!intertitle)"

Related

Replace a substring between two substrings

in regex how to match multiple Or conditions but exclude one condition

Insert space if uppercase letter is preceded and followed by one lowercase letter - Python

Replace a character enclosed with lowercase letters

Use python 3 regex to match a string in double quotes

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex match special case - python

I think you should add "(?=:)" in the and of your pattern: r"80101_\d*(?=:)" This means: select "80101_" + zero or more digits only if it’s followed by ":". In case of "80101_intertitle:Blablabla" we have a non-digit symbol between "80101_" and ":", so it doesn't match.

You could use a negative lookahead: 80101_\d*(?!intertitle) That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used. regex101 demo Your pattern could be written as: pattern = selection + r"_\d*(?!intertitle)"

Related

Replace a substring between two substrings

in regex how to match multiple Or conditions but exclude one condition

Insert space if uppercase letter is preceded and followed by one lowercase letter - Python

Replace a character enclosed with lowercase letters

Use python 3 regex to match a string in double quotes

Categories

Resources

You could use a negative lookahead: 80101_\d(?!intertitle) That negative lookahead (?! ... ) prevents a match if its contents are present at the point it is used. regex101 demo Your pattern could be written as: pattern = selection + r"_\d(?!intertitle)"