convert vim regex to python for re.sub - python

I have a working regex under vim: /^ \{-}\a.*$\n
I implement a global search and replace as :%s/^ \{-}\a.*$\n//
This works great -- removes all lines that start with any number of spaces (matched non-greedily), followed by a letter and anything else to the end of the line including the newline.
I cannot (to save my soul) figure out the analogous regex in Python. Here's what make sense to me:
x = re.sub("^ *?\a.$\n","",y)
But this doesn't do anything.
Many thanks for your sagacious replies.

\a means the bell character (0x07) in Python, and $\n is a redundant bad idea, so:
x = re.sub(r"^ *[A-Za-z].*\n","",y)
Also, there's no reason to write ' *?' instead of ' *' here, as it's always going to be followed by a non-space if it's matching.

If you want to match any number of whitespace, you can also use the \s sequence.
Any letter will be matched by the [a-zA-Z] character class. You also don't need to use the $ and the \n, either will do.
Suggest the following:
x = re.sub(r"^\s*[a-zA-Z].*(\r|\n)","",y)
If you want at least one whitespace, use \s+ instead of \s*

Related

regex in python, remove pattern '[.../...]' from the string in python

I have an input string for e.g:
input_str = 'this is a test for [blah] and [blah/blahhhh]'
and I want to retain [blah] but want to remove [blah/blahhhh] from the above string.
I tried the following codes:
>>>re.sub(r'\[.*?\]', '', input_str)
'this is a test for and '
and
>>>re.sub(r'\[.*?\/.*?\]', '', input_str)
'this is a test for '
what should be the right regex pattern to get the output as "this is a test for [blah] and"?
I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.
Instead of using the wildcards you can use the \w like this:
\[\w+\/\w+\]
Working demo
By the way, if you can have non characters separated by /, then you can use this regex:
\[[^\]]*\/[^\]]*]
Working demo
The reason the second regex in the original post matches more than the OP wants is that . matches any character including ]. So \[.*?\/' (or just \[.*?/ since the \ before the / is superfluous) will match more than it seems the OP wanted: [blah] and [blah/ in input_str.
The ? adds confusion. It will limit repetition of the .* part of .*\] sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the . wildcard to begin with. So-called "greedy" matching of .* is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (] or / in your regexes). Instead of using ? to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.
As an illustration, see the following example of .* grabbing everything until the last occurrence of the character after .*:
echo '////k////,/k' | sed -r 's|/.*/|XXX|'
XXXk
echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
XXXk
And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.
If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:
re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '
This uses a character class expression - [^]] - instead of the wildcard . to match any character that is explicitly not a closing bracket.
If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.
Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.
[1] http://www.regular-expressions.info/repeat.html
You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'

Regular expressions issue with python due to values with brackets

I have a very large string (300 MB+), and it has some garbage data in it that I need to clean up. I am using Python 2.7 32-bit.
I didn't want to use the string operation replace because the file the user uses is only going to grow over time, so I am trying to use re.sub to replace the value of [linender] with a new line character like \n or os.linesep.
It seems simple enough to do, so my pattern is:
re.sub('\[lineender]\b, os.linesep, text_value)
This results in only one value being replaced in the whole string, which is wrong.
Sample Data:
s = """A|B|3[lineender]E|F|2M[lineender]"""
Any ideas on how I need to modify my regex to get this working?
I basically need to replace the bracket word with a new line character.
Note that \b in a non-raw string literal is a backspace. If you use a word boundary r'\b', it will require a word char (a letter, digit or an underscore) after ]. In your case, I'd remove \b altogether:
re.sub(r'\[lineender]', os.linesep, text_value)
If you want to make sure there is no word char after ], you may replace \b with \B, but please make sure you are using the r prefix to make your string literal raw.
See Python demo:
import re, os
text_value = """A|B|3[lineender]E|F|2M[lineender]"""
print('"{}"'.format(re.sub(r'\[lineender]', os.linesep, text_value)))
You need to pass the pattern as a raw string:
re.sub(r'\[lineender\]\b', os.linesep, text_value)
alternatively, you'll have to use \\ (double backslashes):
re.sub('\\[lineender\\]\\b', os.linesep, text_value)

How do you replace regex end-of-line with string?

I have a for loop that produces a variable current_out_dir, sometimes the variable will have a /. at the end of the line (that is /.$) I want to replace /.$ with /$. Currently I have .replace('/.','/'), but this would replace hidden directories that start with . as well. e.g. /home/.log/file.txt
I've looked into re.sub() but I can't figure out how to apply it.
Dot will match any character not of newline character. So you need to escape the dot to match a literal dot.
re.sub(r'(?<=/)\.$', r'', string)
/\.(?=$)
Try this.This should work for you.This uses a positive lookahead to assert end of string.
The question was about using regex, but I've come up with a more pythonic solution to the problem.
if os.path.split(current_out_dir)[1] == '.':
current_out_dir = os.path.split(current_out_dir)[0]

Python regex positive look ahead

I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.
pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"
What is the function of the following lookahead?
pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"
Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?
I'm not really sure what you are tying to achieve here.
Sequence of words ended by a punctuation can be matched with something like:
re.findall(r'([\w\s]*[\?\!\.;])', s)
the lookahead requires another string to follow?
In any case:
\s requires one and only one space;
\s+ requires at least one space.
And yes, the lookahead accepts the "+" modifier even in python 2.x
The same as before but with a lookahead:
re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)
or
re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)
you can try them all on something like:
s='Stefano ciao. a domani. a presto;'
Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.
The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.
The + is called a quantifier. It means 1 to n as many as possible.
To recap
\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)
Further studying.
I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter
To answer this comment please consider :
What does \w.+? actually matches?
A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources