regex in python, remove pattern '[.../...]' from the string in python

regex in python, remove pattern '[.../...]' from the string in python - python

I have an input string for e.g:
input_str = 'this is a test for [blah] and [blah/blahhhh]'
and I want to retain [blah] but want to remove [blah/blahhhh] from the above string.
I tried the following codes:
>>>re.sub(r'\[.*?\]', '', input_str)
'this is a test for and '
and
>>>re.sub(r'\[.*?\/.*?\]', '', input_str)
'this is a test for '
what should be the right regex pattern to get the output as "this is a test for [blah] and"?

I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.
Instead of using the wildcards you can use the \w like this:
\[\w+\/\w+\]
Working demo
By the way, if you can have non characters separated by /, then you can use this regex:
\[[^\]]*\/[^\]]*]
Working demo

The reason the second regex in the original post matches more than the OP wants is that . matches any character including ]. So \[.*?\/' (or just \[.*?/ since the \ before the / is superfluous) will match more than it seems the OP wanted: [blah] and [blah/ in input_str.
The ? adds confusion. It will limit repetition of the .* part of .*\] sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the . wildcard to begin with. So-called "greedy" matching of .* is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (] or / in your regexes). Instead of using ? to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.
As an illustration, see the following example of .* grabbing everything until the last occurrence of the character after .*:
echo '////k////,/k' | sed -r 's|/.*/|XXX|'
XXXk
echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
XXXk
And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.
If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:
re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '
This uses a character class expression - [^]] - instead of the wildcard . to match any character that is explicitly not a closing bracket.
If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.
Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.
[1] http://www.regular-expressions.info/repeat.html

You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'

Related

How to understand snippet of Regex

I am attempting to understand what this snippet of code does:
passwd1=re.sub(r'^.*? --', ' -- ', line)
password=passwd1[4:]
I understand that the top line uses regex to remove the " -- ", and the bottom line I think removes something as well? I went back to this code after a while and need to improve it but to do that I need to understand this again. I've been trying to read regex docs to no avail, what is this: r'^.*? at the beginning of the regex?.

To break r'^.*? -- into pieces:
r in front of a string in Python lets the interpreter know that it's a regex string. This lets you not have to do a bunch of confusing character escaping.
The ^ tells the regex to match only from the beginning of the string.
.*? tells the regex to match any number of characters up to...
--, which is a literal match.
The sum of this is that it will match any string, starting at the beginning of a line up to the -- demarcation. Since it is re.sub(), the matched part of the string will be replaced with --.
This is why something like Google -- MyPassword becomes -- MyPassword.
The second line is a simple string slice, dropping the first four elements (characters) of the string. This might be superfluous - you could just substitute the match with an empty string like this:
passwd1 = re.sub(r'^.* --', '', line)
This achieves the same result. Note I've dropped the ?, which is also superfluous here, because the * has a similar but broader effect. There are some technical differences, but I don't think you need it for your stated purpose.
? will match zero or one of the previous character - in this case a ., which is 'any character'. The * will match zero or more of the previous character. .* is what is known as a greedy quantifier, and .*? a lazy quantifier. That is, the greedy quantifier will match as much as possible, and the lazy will match as little as possible. The difference between ^.*? -- and ^.* -- is what is matched in this case:
Something something -- mypassword -- yourpassword
In the greedy case, the first two clauses ('something something -- mypassword') are matched and deleted. In the lazy case, only 'something something' is deleted. Most passwords don't include spaces, nevermind ' -- ', so you probably want to use the greedy version.

You can use a site like regex101 to input your regular expression and get some analysis of it. It will tell you whether your regular expression matches some test cases, and also explain what each character in the regular expression means. In this case it matches everything up to and including the first instance of ' -- ' in your string, and replaces it with just the characters ' -- '.
The second line is slicing the string. It takes a substring, skipping over the first four characters and then continuing to the end of the string.
Effectively, given a string which has ' -- ' somewhere in it, this pair of lines will take everything after that substring. However, if that substring is not found in line then instead you will simply be discarding the first four characters. If line has less than four characters you will get an error.

r means Regex
^ means Starts with
. means Any character (except newline character)
* means Zero or more occurrences
? means Zero or one occurrences
In other words, it means that it matches a string that begins with the exact number of characters --. sub replaces that part of the string that matches with ' -- '.
The second command just sets a variable ignoring the first 4 characters of the string which is the newly set ' -- '.

Prevent last duplicate character from string [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c

How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.

Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/

Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?

The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.

Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/

import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Python regex: how to achieve this complex replacement rule?

I'm working with long strings and I need to replace with '' all the combinations of adjacent full stops . and/or colons :, but only when they are not adjacent to any whitespace. Examples:
a.bcd should give abcd
a..::.:::.:bcde.....:fg should give abcdefg
a.b.c.d.e.f.g.h should give abcdefgh
a .b should give a .b, because . here is adjacent to a whitespace on its left, so it has not to be replaced
a..::.:::.:bcde.. ...:fg should give abcde.. ...:fg for the same reason
Well, here is what I tried (without any success).
Attempt 1:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)
I would expect to get 'abcdefgh' but what I actually get is r''. I understood why: the code
re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)
returns '.' instead of '\.', and thus re.search doesn't understand that it has to replace the single full stop . rather than understanding '.' as the usual regex.
Attempt 2:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)
This doesn't work as it returns a.b.c.d.e.f.gh.
Attempt 3:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)
This works on s1, but it doesn't solve my problem because on s2 = r'a .b' it returns a b rather than a .b.
Any suggestion?

There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub and re.search is off.
To find something, re.search lets you find where in a string that something occurs.
To replace that something, use re.sub on the same regular expression instead of re.search, not as well.
And, understand that re.sub(r'thing(moo)other', '', s1) replaces the entire match with the replacement string.
With that out of the way, for your regex, it sounds like you want
r'(?<![\s.:])[.:]+(?![\s.:])' # updated from comments, thanks!
which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning1), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the + less strictly (it will do its darndest to find a match if there is a way).
Now, the regex only matches the part you want to actually replace, so you can do
>>> import re
>>> s1 = 'name.surname#domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname#domaincom'
though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:
>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname#domaincom'
See how \g<1> in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2> to the second parenthesized group.
You will also notice that this failed to replace the first full stop, because the .* inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.
1 You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?

Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.

Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.

>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.

It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.

If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.

mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.