How to understand snippet of Regex

How to understand snippet of Regex - python

I am attempting to understand what this snippet of code does:
passwd1=re.sub(r'^.*? --', ' -- ', line)
password=passwd1[4:]
I understand that the top line uses regex to remove the " -- ", and the bottom line I think removes something as well? I went back to this code after a while and need to improve it but to do that I need to understand this again. I've been trying to read regex docs to no avail, what is this: r'^.*? at the beginning of the regex?.

To break r'^.*? -- into pieces:
r in front of a string in Python lets the interpreter know that it's a regex string. This lets you not have to do a bunch of confusing character escaping.
The ^ tells the regex to match only from the beginning of the string.
.*? tells the regex to match any number of characters up to...
--, which is a literal match.
The sum of this is that it will match any string, starting at the beginning of a line up to the -- demarcation. Since it is re.sub(), the matched part of the string will be replaced with --.
This is why something like Google -- MyPassword becomes -- MyPassword.
The second line is a simple string slice, dropping the first four elements (characters) of the string. This might be superfluous - you could just substitute the match with an empty string like this:
passwd1 = re.sub(r'^.* --', '', line)
This achieves the same result. Note I've dropped the ?, which is also superfluous here, because the * has a similar but broader effect. There are some technical differences, but I don't think you need it for your stated purpose.
? will match zero or one of the previous character - in this case a ., which is 'any character'. The * will match zero or more of the previous character. .* is what is known as a greedy quantifier, and .*? a lazy quantifier. That is, the greedy quantifier will match as much as possible, and the lazy will match as little as possible. The difference between ^.*? -- and ^.* -- is what is matched in this case:
Something something -- mypassword -- yourpassword
In the greedy case, the first two clauses ('something something -- mypassword') are matched and deleted. In the lazy case, only 'something something' is deleted. Most passwords don't include spaces, nevermind ' -- ', so you probably want to use the greedy version.

You can use a site like regex101 to input your regular expression and get some analysis of it. It will tell you whether your regular expression matches some test cases, and also explain what each character in the regular expression means. In this case it matches everything up to and including the first instance of ' -- ' in your string, and replaces it with just the characters ' -- '.
The second line is slicing the string. It takes a substring, skipping over the first four characters and then continuing to the end of the string.
Effectively, given a string which has ' -- ' somewhere in it, this pair of lines will take everything after that substring. However, if that substring is not found in line then instead you will simply be discarding the first four characters. If line has less than four characters you will get an error.

r means Regex
^ means Starts with
. means Any character (except newline character)
* means Zero or more occurrences
? means Zero or one occurrences
In other words, it means that it matches a string that begins with the exact number of characters --. sub replaces that part of the string that matches with ' -- '.
The second command just sets a variable ignoring the first 4 characters of the string which is the newly set ' -- '.

Related

How do I build a tokenizing regex based iterator in python

I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.
Following is my code taken from that answer:
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
print(string[prev.end(): curr.start()]) # originally I yield here
I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:
"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).
My questions are:
Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must
produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)

The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:
delimiter_re = r'[\n\- ]' # newline, hyphen, or space
search_regex = r'''^(?!{0}) # string start with no delimiter
| # or
{0}+ # sequence of delimiters (at least one)
| # or
(?<!{0})$ # string end with no delimiter
'''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)
Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.
It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:
token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
do_something_with(string[previous_end:match.start()])
previous_end = match.end()
do_something_with(string[previous_end:])
The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.
The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)
The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.

It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:
# match any number of consecutive non-delim chars
string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
print(match.group(0))
output:
dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

regex in python, remove pattern '[.../...]' from the string in python

I have an input string for e.g:
input_str = 'this is a test for [blah] and [blah/blahhhh]'
and I want to retain [blah] but want to remove [blah/blahhhh] from the above string.
I tried the following codes:
>>>re.sub(r'\[.*?\]', '', input_str)
'this is a test for and '
and
>>>re.sub(r'\[.*?\/.*?\]', '', input_str)
'this is a test for '
what should be the right regex pattern to get the output as "this is a test for [blah] and"?

I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.
Instead of using the wildcards you can use the \w like this:
\[\w+\/\w+\]
Working demo
By the way, if you can have non characters separated by /, then you can use this regex:
\[[^\]]*\/[^\]]*]
Working demo

The reason the second regex in the original post matches more than the OP wants is that . matches any character including ]. So \[.*?\/' (or just \[.*?/ since the \ before the / is superfluous) will match more than it seems the OP wanted: [blah] and [blah/ in input_str.
The ? adds confusion. It will limit repetition of the .* part of .*\] sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the . wildcard to begin with. So-called "greedy" matching of .* is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (] or / in your regexes). Instead of using ? to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.
As an illustration, see the following example of .* grabbing everything until the last occurrence of the character after .*:
echo '////k////,/k' | sed -r 's|/.*/|XXX|'
XXXk
echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
XXXk
And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.
If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:
re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '
This uses a character class expression - [^]] - instead of the wildcard . to match any character that is explicitly not a closing bracket.
If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.
Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.
[1] http://www.regular-expressions.info/repeat.html

You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.

If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.

You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.

It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

match part of a string until it reaches the end of the line (python regex)

If I have a large string with multiple lines and I want to match part of a line only to end of that line, what is the best way to do that?
So, for example I have something like this and I want it to stop matching when it reaches the new line character.
r"(?P<name>[A-Za-z\s.]+)"
I saw this in a previous answer:
$ - indicates matching to the end of the string, or end of a line if
multiline is enabled.
My question is then how do you "enable multiline" as the author of that answer states?

Simply use
r"(?P<name>[A-Za-z\t .]+)"
This will match ASCII letters, spaces, tabs or periods. It'll stop at the first character that's not included in the group - and newlines aren't (whereas they are included in \s, and because of that it's irrelevant whether multiline mode is turned on or off).

You can enable multiline matching by passing re.MULTILINE as the second argument to re.compile(). However, there is a subtlety to watch out for: since the + quantifier is greedy, this regular expression will match as long a string as possible, so if the next line is made up of letters and whitespace, the regex might match more than one line ($ matches the end of any string).
There are three solutions to this:
Change your regex so that, instead of matching any whitespace including newline (\s) your repeated character set does not match that newline.
Change the quantifier to +?, the non-greedy ("minimal") version of +, so that it will match as short a string as possible and therefore stop at the first newline.
Change your code to first split the text up into an individual string for each line (using text.split('\n').

Look at the flags parameter at http://docs.python.org/library/re.html#module-contents

Regular expression syntax for "match nothing"?

I have a python template engine that heavily uses regexp. It uses concatenation like:
re.compile( regexp1 + "|" + regexp2 + "*|" + regexp3 + "+" )
I can modify the individual substrings (regexp1, regexp2 etc).
Is there any small and light expression that matches nothing, which I can use inside a template where I don't want any matches? Unfortunately, sometimes '+' or '*' is appended to the regexp atom so I can't use an empty string - that will raise a "nothing to repeat" error.

This shouldn't match anything:
re.compile('$^')
So if you replace regexp1, regexp2 and regexp3 with '$^' it will be impossible to find a match. Unless you are using the multi line mode.
After some tests I found a better solution
re.compile('a^')
It is impossible to match and will fail earlier than the previous solution. You can replace a with any other character and it will always be impossible to match

(?!) should always fail to match. It is the zero-width negative look-ahead. If what is in the parentheses matches then the whole match fails. Given that it has nothing in it, it will fail the match for anything (including nothing).

To match an empty string - even in multiline mode - you can use \A\Z, so:
re.compile('\A\Z|\A\Z*|\A\Z+')
The difference is that \A and \Z are start and end of string, whilst ^ and $ these can match start/end of lines, so $^|$^*|$^+ could potentially match a string containing newlines (if the flag is enabled).
And to fail to match anything (even an empty string), simply attempt to find content before the start of the string, e.g:
re.compile('.\A|.\A*|.\A+')
Since no characters can come before \A (by definition), this will always fail to match.

Maybe '.{0}'?

You could use
\z..
This is the absolute end of string, followed by two of anything
If + or * is tacked on the end this still works refusing to match anything

Or, use some list comprehension to remove the useless regexp entries and join to put them all together. Something like:
re.compile('|'.join([x for x in [regexp1, regexp2, ...] if x != None]))
Be sure to add some comments next to that line of code though :-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.