Why is re.sub not called re.replace - python

I've been writing Python for quite some time now, and so far it seems like the creators of the language put a lot of effort into readability of the code, a good example of this would be the re (Regular Expression) module.
Almost every method is clear in what it does:
re.search Scan through string looking for a match to the pattern
re.split Split the source string by the occurrences of the pattern
re.escape Escape all the characters in pattern except ASCII letters, numbers and '_'.
etc..
Until we hit the following two methods:
re.sub
re.subn
These methods are meant to do a regex based replace, however the naming convention seem strange and out of place to me (especially when starting out, I had to constantly look the method names up). C# for instance does call the method Regex.Replace. source
Is there a reason behind naming these methods sub and subn?
Why didn't the developers simply name it re.replace?

The traditional name of this regex command is substitute (or substitution). It comes from the original Unix ed, where you use s to perform it, and this has been retained in sed; perl also uses the s command syntax.
From Sed - An Introduction and Tutorial by Bruce Barnett
The essential command: s for substitution
Sed has several commands, but most people only learn the substitute
command: s. The substitute command changes all occurrences of the
regular expression into a new value. A simple example is changing
"day" in the "old" file to "night" in the "new" file:
sed s/day/night/ <old >new

Related

REGEX: Negative lookbehind with multiple whitespaces [duplicate]

I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:
(?<=this\sis\san\s*?)example
What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?
I also tried those two and they work correctly, but don't fulfill my needs:
(?<=this\sis\san\s)example
this\sis\san\s*?example
I am using this site to test my regular expressions: http://gskinner.com/RegExr/
Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:
only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)
The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.
Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).
See also section about limitations of look-behind assertions on Regular-Expressions.info.
Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.
This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..
But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...
Example:
string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'
matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag
What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group
(?<=this\sis\san)(?:\s*)example
That make it a fixed length look behind, so it should work.
You can use sub-expressions.
(this\sis\san\s*?)(example)
So to retrieve group 2, "example", $2 for regex, or \2 if you're using a format string (like for python's re.sub)
Most regex engines don't support variable-length expressions for lookbehind assertions.

Extracting all Latex commands from a Latex code File

I am trying to extract all the latex commands from a tex file. I have to use Python for this. I tried to extract the latex commands in a list using Re module.
The problem is that this list does not contain the latex commands whose name includes special characters (such as \alpha*, \a', \#, \$, +, :, \; etc). It only contains the latex commands that consist of letters.
I am presently using the re.match python command :
"I already know the starting index of '\' which is at self.i.
The example Latex code string could be:
\documentclass[envcountsame,envcountchap]{svmono}"
match_text = re.match("[\w]+", search_string[self.i + 1:])
I am able to extract 'documentclass'. But suppose there is another command like:
"\abstract*[alpha]{beta}"
"\${This is a latex document}"
"\:"
How do I extract only 'abstract*', '$', ':' from these strings?
I am new to Python and tried various approaches, but am not able to extract all these command names. If there is a general python Regex that can handle all these cases, it would be useful.
NOTE: A book called 'The Not So Short introduction to LaTeX' defines that the format of LaTeX commands can be of three types -
FORMATS:
They start with a backslash \ and then have a name consisting of
letters only. Command names are terminated by a space, a number or
any other ‘non-letter.’
They consist of a backslash and exactly one non-letter.
Many commands exist in a ‘starred variant’ where a star is appended to the command name.
Here's the exact translation of your format specification:
\\(?:[^a-zA-Z]|[a-zA-Z]+)\*?
Demo
non-letter: [^a-zA-Z]
or letters: [a-zA-Z]+
starred variant: \*?
If your format description is accurate, this should do it. Unfortunately I don't know LaTeX so I'm not sure it's 100% OK.
From the feedback in the comments, it turns out the star is applicable only to letter commands, and there can be some other terminating characters as well. The final regex is:
\\(?:[^a-zA-Z]|[a-zA-Z]+[*=']?)
LaTeX is a TeX macro package, and as so, all that's applicable to TeX is also applicable to LaTeX.
The question you ask is a difficult one, as TeX is not a regular language. If you want only to deal with commands, you have to check for \\([A-Za-z]+ *|.|\n) regex (see demo), with the notice that in TeX you have active characters, that is, characters for which the only presence acts like a command. If you want to deal with command parameters, you'll have to check the individual command definitions, because TeX is a Polish Notation (operators or commands are prefix, with a variable number of positional parameters) language. For parameter extraction, TeX uses brace matching which is context free and not regular, so you'll need a complete parser for that.
TeX allows you to redefine all character classes, so you can redefine the digits to act as letters, and be usable as command names (so for example \a23 is a valid command name) (this happens inside the package definitions, where the # is used as a letter, to be able to make commands that are inaccessible to users, but available inside the package)
Eliminating LaTeX markup is a difficult thing for this reason and you can only achieve partial results. There are many different problems to be solved (what to do with \include directives, what to do with valid text in parameters like \chapter parameters or \footnote, you want the index included, etc.)
Also, you have to be carefull, as if you try to eliminate command parameters, you'll be also eliminating part of your text (for example the text in \footnote, \abstract, \title, \chapter{...}, etc.) I don't know the effect you actually want to get, so I cannot give you more info in this respect.

What does the "s!" operator in Perl do?

I have this Perl snippet from a script that I am translating into Python. I have no idea what the "s!" operator is doing; some sort of regex substitution. Unfortunately searching Google or Stackoverflow for operators like that doesn't yield many helpful results.
$var =~ s!<foo>.+?</foo>!!;
$var =~ s!;!/!g;
What is each line doing? I'd like to know in case I run into this operator again.
And, what would equivalent statements in Python be?
s!foo!bar! is the same as the more common s/foo/bar/, except that foo and bar can contain unescaped slashes without causing problems. What it does is, it replaces the first occurence of the regex foo with bar. The version with g replaces all occurences.
It's doing exactly the same as $var =~ s///. i.e. performing a search and replace within the $var variable.
In Perl you can define the delimiting character following the s. Why ? So, for example, if you're matching '/', you can specify another delimiting character ('!' in this case) and not have to escape or backtick the character you're matching. Otherwise you'd end up with (say)
s/;/\//g;
which is a little more confusing.
Perlre has more info on this.
Perl lets you choose the delimiter for many of its constructs. This makes it easier to see what is going on in expressions like
$str =~ s{/foo/bar/baz/}{/quux/};
As you can see though, not all delimiters have the same effects. Bracketing characters (<>, [], {}, and ()) use different characters for the beginning and ending. And ?, when used as a delimiter to a regex, causes the regexes to match only once between calls to the reset() operator.
You may find it helpful to read perldoc perlop (in particular the sections on m/PATTERN/msixpogc, ?PATTERN?, and s/PATTERN/REPLACEMENT/msixpogce).
s! is syntactic sugar for the 'proper' s/// operator. Basically, you can substitute whatever delimiter you want instead of the '/'s.
As to what each line is doing, the first line is matching occurances of the regex <foo>.+?</foo> and replacing the whole lot with nothing. The second is matching the regex ; and replacing it with /.
s/// is the substitute operator. It takes a regular expression and a substitution string.
s/regex/replace string/;
It supports most (all?) of the normal regular expression switches, which are used in the normal way (by appending them to the end of the operator).
s is the substitution operator. Usually it is in the form of s/foo/bar/, but you can replace // separator characters some other characters like !. Using other separator charaters may make working with things like paths a lot easier since you don't need to escape path separators.
See manual page for further info.
You can find similar functionality for python in re-module.
s is the substitution operator. Normally this uses '/' for the delimiter:
s/foo/bar/
, but this is not required: a number of other characters can be used as delimiters instead. In this case, '!' has been used as the delimiter, presumably to avoid the need to escape the '/' characters in the actual text to be substituted.
In your specific case, the first line removes text matching '.+?'; i.e. it removes 'foo' tags with or without content.
The second line replaces all ';' characters with '/' characters, globally (all occurences).
The python equivalent code uses the re module:
f=re.sub(searchregx,replacement_str,line)
And the python equivalent is to use the re module.

Parsing in Python: what's the most efficient way to suppress/normalize strings?

I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?
Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).
Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.
Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)

How can I translate the following filename to a regular expression in Python?

I am battling regular expressions now as I type.
I would like to determine a pattern for the following example file: b410cv11_test.ext. I want to be able to do a search for files that match the pattern of the example file aforementioned. Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? Thanks in advance.
Further clarification of question:
I would like the pattern to be as follows: must start with 'b', followed by three digits, followed by 'cv', followed by two digits, then an underscore, followed by 'release', followed by .'ext'
Now that you have a human readable description of your file name, it's quite straight forward to translate it into a regular expression (at least in this case ;)
must start with
The caret (^) anchors a regular expression to the beginning of what you want to match, so your re has to start with this symbol.
'b',
Any non-special character in your re will match literally, so you just use "b" for this part: ^b.
followed by [...] digits,
This depends a bit on which flavor of re you use:
The most general way of expressing this is to use brackets ([]). Those mean "match any one of the characters listed within. [ASDF] for example would match either A or S or D or F, [0-9] would match anything between 0 and 9.
Your re library probably has a shortcut for "any digit". In sed and awk you could use [[:digit:]] [sic!], in python and many other languages you can use \d.
So now your re reads ^b\d.
followed by three [...]
The most simple way to express this would be to just repeat the atom three times like this: \d\d\d.
Again your language might provide a shortcut: braces ({}). Sometimes you would have to escape them with a backslash (if you are using sed or awk, read about "extended regular expressions"). They also give you a way to say "at least x, but no more than y occurances of the previous atom": {x,y}.
Now you have: ^b\d{3}
followed by 'cv',
Literal matching again, now we have ^b\d{3}cv
followed by two digits,
We already covered this: ^b\d{3}cv\d{2}.
then an underscore, followed by 'release', followed by .'ext'
Again, this should all match literally, but the dot (.) is a special character. This means you have to escape it with a backslash: ^\d{3}cv\d{2}_release\.ext
Leaving out the backslash would mean that a filename like "b410cv11_test_ext" would also match, which may or may not be a problem for you.
Finally, if you want to guarantee that there is nothing else following ".ext", anchor the re to the end of the thing to match, use the dollar sign ($).
Thus the complete regular expression for your specific problem would be:
^b\d{3}cv\d{2}_release\.ext$
Easy.
Whatever language or library you use, there has to be a reference somewhere in the documentation that will show you what the exact syntax in your case should be. Once you have learned to break down the problem into a suitable description, understanding the more advanced constructs will come to you step by step.
To avoid confusion, read the following, in order.
First, you have the glob module, which handles file name regular expressions just like the Windows and unix shells.
Second, you have the fnmatch module, which just does pattern matching using the unix shell rules.
Third, you have the re module, which is the complete set of regular expressions.
Then ask another, more specific question.
I would like the pattern to be as
follows: must start with 'b', followed
by three digits, followed by 'cv',
followed by two digits, then an
underscore, followed by 'release',
followed by .'ext'
^b\d{3}cv\d{2}_release\.ext$
Your question is a bit unclear. You say you want a regular expression, but could it be that you want a glob-style pattern you can use with commands like ls? glob expressions and regular expressions are similar in concept but different in practice (regular expressions are considerably more powerful, glob style patterns are easier for the most common cases when looking for files.
Also, what do you consider to be the pattern? Certainly, * (glob) or .* (regex) will match the pattern. Also, _test.ext (glob) or ._test.ext (regexp) pattern would match, as would many other variations.
Can you be more specific about the pattern? For example, you might describe it as "b, followed by digits, followed by cv, followed by digits ..."
Once you can precisely explain the pattern in your native language (and that must be your first step), it's usually a fairly straight-forward task to translate that into a glob or regular expression pattern.
if the letters are unimportant, you could try \w\d\d\d\w\w\d\d_test.ext which would match the letter/number pattern, or b\d\d\dcv\d\d_test.ext or some mix of the two.
When working with regexes I find the Mochikit regex example to be a great help.
/^b\d\d\dcv\d\d_test\.ext$/
Then use the python re (regex) module to do the match. This is of course assuming regex is really what you need and not glob as the others mentioned.

Categories

Resources