python regex - removing 'word' while keeping ['word'] - python

I'd like to remove Opportunity while keeping '[Opportunity]'.
Winery Tailspin Electonic Opportunity [Opportunity].[Opportunity Name]
How do I do that?

You can use count parameter in re.sub as below if it occurs always before your '[word]'.
re.sub('Opportunity','',string,count = 1)

I am not sure what do you want with [Opportunity Name] bit, but following line will remove all Opportunity which have not adjacent [ or ]:
re.sub('([^\[])(Opportunity)([^\]])','\g<1>\g<3>',string)
This code use grouping in regex and match strings in form of
(any character different than [)(Opportunity)(any character different than ])
then replace with first and third group i.e. adjacent characters.
Using your example would give in effect
Winery Tailspin Electonic [Opportunity].[Opportunity Name]
Notice however, that this solution will work if and only if Opportunity is not first and not last word. Is this true in your case?

Related

Regex - Group everything until last occurence

When working on this string:
see.Ya23.v2.0023.jpg
I already found out I could get the last occurence of a number by using:
(?P<Frame>\d+(?!.*\d))
It gives me the group containing "0023".
But how do I group everything until that happens?
If I do this:
(?P<Sequence>.*)(?P<Frame>\d+(?!.*\d))
My two groups contain "see.Ya23.v2.002" and "3", when I would like to have to have them contain "see.Ya23.v2." and "0023".
Hope you can help me. Thanks in advance.
You almost got it completely.
just in the first group you can add the lazy indicator ? after any match. that causes to drop the selection at the first possible possition.
(?P<Sequence>.*?)(?P<Frame>\d+(?!.*\d))
this will give you
see.Ya23.v2. and 0023
and if you also want to avoid selecting the dot
(?P<Sequence>.*?)\.(?P<Frame>\d+(?!.*\d))
the result is see.Ya23.v2 and 0023
The simplest and quickest way is to put a negative assertion for a digit
before your digit expression at the start of the Frame group.
This will make sure the Frame is the last complete set of digits and
still allow a greedy Sequence match which give a performance boost.
(?P<Sequence>.*)(?P<Frame>(?<!\d)\d+(?!.*\d))
https://regex101.com/r/LCUoCR/1
The problem is explained in my Youtube video related to how backtracking works in regex.
In short: the .* part matches the whole string first, and then the regex engine starts stepping back through the string to accommodate a part for the subsequent patterns, i.e. for \d+(?!.*\d). Once the 3 is found in see.Ya23.v2.0023.jpg, this pattern matches, and the regex engine returns a match.
All you need is to make sure the char before the \d+ is a non-digit char and you need to use
(?P<Sequence>(?:.*\D)?)(?P<Frame>\d+)(?!.*\d)
See the regex demo.

Regex (python) exclude some part of the replacement

For sure there are other ways to solve this, but I'm interested if this can be solved exclusively via regex. I have lines of text like this:
9,A
11,B
22,>
33,B
72,A
91,<
112,A
162,B
When I try to apply this replacement to basically "join" or erase the part between arrows and replace them with "+++":
re.sub(r'\>(\n\d.+)+<','+++',string_above)
I get this, which is fine:
9,A
11,B
22,+++
112,A
162,B
But what if want to keep that last number before the "<" sign and "X" last say, so to get something like this:
9,A
11,B
22,+++
91,X
112,A
162,B
How can I do that?
In this concrete case, you may replace with
r'+++\1X'
See the regex demo
If X is a digit, replace with
r'+++\g<1>X'
The \1 and \g<1> are called replacement backreferences, these refer to the capturing group #1 value.

Python re module groups match mechanism

Question Formation
background
As I am reading through the tutorial at python2.7 redoc, it introduces the behavior of the groups:
The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.
question
I clearly understands how this works singly. but I can understand the following example:
>>> m = re.match("([abc])+","abc")
>>> m.groups()
('c',)
I mean, isn't + simply means one or more. If so, shouldn't the regex ([abc])+ = ([abc])([abc])+ (not formal BNF). Thus, the result should be:
('a','b','c')
Please shed some light about the mechanism behind, thanks.
P.S
I want to learn the regex language interpreter, how should I start with? books or regex version, thanks!
Well, I guess a picture is worth a 1000 words:
link to the demo
what's happening is that, as you can see on the visual representation of the automaton, your regexp is grouping over a one character one or more times until it reaches the end of the match. Then that last character gets into the group.
If you want to get the output you say, you need to do something like the following:
([abc])([abc])([abc])
which will match and group one character at each position.
About documentation, I advice you to read first theory of NFA, and regexps. The MIT documentation on the topic is pretty nice:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011/lecture-notes/
Basically, the groups that are referred to in regex terminology are the capture groups as defined in your regex.
So for example, in '([abc])+', there's only a single capture group, namely, ([abc]), whereas in something like '([abc])([xyz])+' there are 2 groups.
So in your example, calling .groups() will always return a tuple of length 1 because that is how many groups exist in your regex.
The reason why it isn't returning the results you'd expect is because you're using the repeat operator + outside of the group. This ends up causing the group to equal only the last match, and thus only the last match (c) is retained. If, on the other hand, you had used '([abc]+)' (notice the + is inside the capture group), the results would have been:
('abc',)
One pair of grouping parentheses forms one group, even if it's inside a quantifier. If a group matches multiple times due to a quantifier, only the last match for that group is saved. The group doesn't become as many groups as it had matches.

Python Regex instantly replace groups

Is there any way to directly replace all groups using regex syntax?
The normal way:
re.match(r"(?:aaa)(_bbb)", string1).group(1)
But I want to achieve something like this:
re.match(r"(\d.*?)\s(\d.*?)", "(CALL_GROUP_1) (CALL_GROUP_2)")
I want to build the new string instantaneously from the groups the Regex just captured.
Have a look at re.sub:
result = re.sub(r"(\d.*?)\s(\d.*?)", r"\1 \2", string1)
This is Python's regex substitution (replace) function. The replacement string can be filled with so-called backreferences (backslash, group number) which are replaced with what was matched by the groups. Groups are counted the same as by the group(...) function, i.e. starting from 1, from left to right, by opening parentheses.
The accepted answer is perfect. I would add that group reference is probably better achieved by using this syntax:
r"\g<1> \g<2>"
for the replacement string. This way, you work around syntax limitations where a group may be followed by a digit. Again, this is all present in the doc, nothing new, just sometimes difficult to spot at first sight.

re.sub not replacing all occurrences

I'm not a Python developer, but I'm using a Python script to convert SQLite to MySQL
The suggested script gets close, but no cigar, as they say.
The line giving me a problem is:
line = re.sub(r"([^'])'t'(.)", r"\1THIS_IS_TRUE\2", line)
...along with the equivalent line for false ('f'), of course.
The problem I'm seeing is that only the first occurrence of 't' in any given line is replaced.
So, input to the script,
INSERT INTO "cars" VALUES(56,'Bugatti Veyron','BUG 1',32,'t','t','2011-12-14 18:39:16.556916','2011-12-15 11:25:03.675058','81');
...gives...
INSERT INTO "cars" VALUES(56,'Bugatti Veyron','BUG 1',32,THIS_IS_TRUE,'t','2011-12-14 18:39:16.556916','2011-12-15 11:25:03.675058','81');
I mentioned I'm not a Python developer, but I have tried to fix this myself. According to the documentation, I understand that re.sub should replace all occurrences of 't'.
I'd appreciate a hint as to why I'm only seeing the first occurrence replaced, thanks.
The two substitutions you'd want in your example overlap - the comma between your two instances of 't' will be matched by (.) in the first case, so ([^']) in the second case never gets a chance to match it. This slightly modified version might help:
line = re.sub(r"(?<!')'t'(?=.)", r"THIS_IS_TRUE", line)
This version uses lookahead and lookbehind syntax, described here.
How about
line = line.replace("'t'", "THIS_IS_TRUE").replace("'f'", "THIS_IS_FALSE")
without using re. This replaces all occurrences of 't' and 'f'. Just make sure that no car is named t.
The first match you see is ,'t',. Python proceeds starting with the next character, which is ' (before the second t), subsequently, it cannot match the ([^']) part and skips the second 't'.
In other words, subsequent matches to be replaced cannot overlap.
using re.sub(r"\bt\b","THIS_IS_TRUE",line):
In [21]: strs="""INSERT INTO "cars" VALUES(56,'Bugatti Veyron','BUG 1',32,'t','t','2011-12-14 18:39:16.556916','2011-12-15 11:25:03.675058','81');"""
In [22]: print re.sub(r"\bt\b","THIS_IS_TRUE",strs)
INSERT INTO "cars" VALUES(56,'Bugatti Veyron','BUG 1',32,'THIS_IS_TRUE','THIS_IS_TRUE','2011-12-14 18:39:16.556916','2011-12-15 11:25:03.675058','81');

Categories

Resources