Related
I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular (atomic grouping since Ruby 1.9.3)
JavaScript API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)
The problem is I cannot avoid working with extremely big files which contain no newlines in them:
<a>text1</a>...gigabytes of data here, all in one single line...[a text to extract b>
What should I do if I want to copy matches from this file (putting every match in a separate line, for convenience)? Say, <b>.*?</b>.
If I use
grep -Pzo '\[a .*? b>' path/to/input.txt > path/to/output.txt
it will just give an error: memory exhausted (this is a related question: grep-memory-exhausted).
Neither sed nor awk won't allow to work with such a file. So, how should I extract matches from it?
#!/usr/bin/perl
use strict;
use warnings;
use constant BLOCK_SIZE => 64*1024;
my $buf = "";
my $searching = 1;
while (1) {
my $rv = read(\*STDIN, $buf, BLOCK_SIZE, length($buf));
die($!) if !defined($rv);
last if !$rv
while (1) {
if ($searching) {
my $len = $buf =~ m{\[(?:a|\z)} ? $-[0] : length($buf);
substr($buf, 0, $len, '');
last if $buf !~ s{^\[a}{};
$searching = 0;
} else {
my $len = $buf =~ m{b(?:>|\z)} ? $-[0] : length($buf);
print substr($buf, 0, $len, '');
last if $buf !~ s{^b>}{};
print("\n");
$searching = 1;
}
}
}
Lots of assumptions made:
Assumes the start tag is spelled exactly [a.
Assumes the end tag is spelled exactly b>.
Assumes each start tag has a corresponding end tag.
Assumes each end tag has a corresponding start tag.
Assumes [a won't be found between [a and b>.
Grep has different behavior starting with version 2.21:
When searching binary data, grep now may treat non-text bytes as line
terminators. This can boost performance significantly.
So what happens now is that with binary data, all non-text bytes
(including newlines) are treated as line terminators. If you want to change this
behavior, you can:
use --text. This will ensure that only newlines are line terminators
use --null-data. This will ensure that only null bytes are line terminators
--line-regexp option with null data
[Python people: My question is at the very end :-)]
I want to use UTF-8 within C string literals for readability and easy maintainance. However, this is not universally portable. My solution is to create a file foo.c.in which gets converted by a small perl script to file foo.c so that it contains \xXX escape sequences instead of bytes larger than or equal to 0x80.
For simplicity, I assume that a C string starts and ends in the same line.
This is the Perl code I've created. In case a byte >= 0x80 is found, the original string is emitted as a comment also.
use strict;
use warnings;
binmode STDIN, ':raw';
binmode STDOUT, ':raw';
sub utf8_to_esc
{
my $string = shift;
my $oldstring = $string;
my $count = 0;
$string =~ s/([\x80-\xFF])/$count++; sprintf("\\x%02X", ord($1))/eg;
$string = '"' . $string . '"';
$string .= " /* " . $oldstring . " */" if $count;
return $string;
}
while (<>)
{
s/"((?:[^"\\]++|\\.)*+)"/utf8_to_esc($1)/eg;
print;
}
For example, the input
"fööbär"
gets converted to
"f\xC3\xB6\xC3\xB6b\xC3\xA4r" /* fööbär */
Finally, my question: I'm not very good in Perl, and I wonder whether it is possible to rewrite the code in a more elegant (or more 'Perlish') way. I would also like if someone could point to similar code written in Python.
I think it's best if you don't use :raw. You are processing text, so you should properly decode and encode. That will be far less error prone, and it will allow your parser to use predefined character classes if you so desire.
You parse as if you expect slashes in the literal, but then you completely ignore then when you escape. Because of that, you could end up with "...\\xC3\xA3...". Working with decoded text will also help here.
So forget "perlish"; let's actually fix the bugs.
use open ':std', ':locale';
sub convert_char {
my ($s) = #_;
utf8::encode($s);
$s = uc unpack 'H*', $s;
$s =~ s/\G(..)/\\x$1/sg;
return $s;
}
sub convert_literal {
my $orig = my $s = substr($_[0], 1, -1);
my $safe = '\x20-\x7E'; # ASCII printables and space
my $safe_no_slash = '\x20-\x5B\x5D-\x7E'; # ASCII printables and space, no \
my $changed = $s =~ s{
(?: \\? ( [^$safe] )
| ( (?: [$safe_no_slash] | \\[$safe] )+ )
)
}{
defined($1) ? convert_char($1) : $2
}egx;
# XXX Assumes $orig doesn't contain "*/"
return qq{"$s"} . ( $changed ? " /* $orig */" : '' );
}
while (<>) {
s/(" (?:[^"\\]++|\\.)*+ ")/ convert_literal($1) /segx;
print;
}
Re: a more Perlish way.
You can use arbitrary delimiters for quote operators, so you can use string interpolation instead of explicit concatenation, which can look nicer. Also, counting the number of substitutions is unneccessary: Substitution in scalar context evaluates to the number of matches.
I would have written your (misnomed!) function as
use strict; use warnings;
use Carp;
sub escape_high_bytes {
my ($orig) = #_;
# Complain if the input is not a string of bytes.
utf8::downgrade($orig, 1)
or carp "Input must be binary data";
if ((my $changed = $orig) =~ s/([\P{ASCII}\P{Print}])/sprintf '\\x%02X', ord $1/eg) {
# TODO make sure $orig does not contain "*/"
return qq("$changed" /* $orig */);
} else {
return qq("$orig");
}
}
The (my $copy = $str) =~ s/foo/bar/ is the standard idiom to run a replace in a copy of a string. With 5.14, we could also use the /r modifier, but then we don't know whether the pattern matched, and we would have to resort to counting.
Please be aware that this function has nothing to do with Unicode or UTF-8. The utf8::downgrade($string, $fail_ok) makes sure that the string can be represented using single bytes. If this can't be done (and the second argument is true), then it returns a false value.
The regex operators \p{...} and the negation \P{...} match codepoints that have a certain Unicode property. E.g. \P{ASCII} matches all characters that are not in the range [\x00-\x7F], and \P{Print} matches all characters that are not visible, e.g. control codes like \x00 but not whitespace.
Your while (<>) loop is arguably buggy: This does not neccessarily iterate over STDIN. Rather, it iterates over the contents of the files listed in #ARGV (the command line arguments), or defaults to STDIN if that array is empty. Note that the :raw layer will not be declared for the files from #ARGV. Possible solutions:
You can use the open pragma to declare default layers for all filehandles.
You can while (<STDIN>).
Do you know what is Perlish? Using modules. As it happens, String::Escape already implements much of the functionality you want.
Similar code written in Python
Python 2.7
import re
import sys
def utf8_to_esc(matched):
s = matched.group(1)
s2 = s.encode('string-escape')
result = '"{}"'.format(s2)
if s != s2:
result += ' /* {} */'.format(s)
return result
sys.stdout.writelines(re.sub(r'"([^"]+)"', utf8_to_esc, line) for line in sys.stdin)
Python 3.x
def utf8_to_esc(matched):
...
s2 = s.encode('unicode-escape').decode('ascii')
...
I have a large word list file with one word per line. I would like to filter out the words with repeating alphabets.
INPUT:
abducts
abe
abeam
abel
abele
OUTPUT:
abducts
abe
abel
I'd like to do this using Regex (grep or perl or python). Is that possible?
It's much easier to write a regex that matches words that do have repeating letters, and then negate the match:
my #input = qw(abducts abe abeam abel abele);
my #output = grep { not /(\w).*\1/ } #input;
(This code assumes that #input contains one word per entry.) But this problem isn't necessarily best solved with a regex.
I've given the code in Perl, but it could easily be translated into any regex flavor that supports backreferences, including grep (which also has the -v switch to negate the match).
$ egrep -vi '(.).*\1' wordlist
It is possible to use regex:
import re
inp = [
'abducts'
, 'abe'
, 'abeam'
, 'abel'
, 'abele'
]
# detect word which contains a character at least twice
rgx = re.compile(r'.*(.).*\1.*')
def filter_words(inp):
for word in inp:
if rgx.match(word) is None:
yield word
print list(filter_words(inp))
Simple Stuff
Despite the inaccurate protestation that this is impossible with a regex, it certainly is.
While #cjm justly states that it is a lot easier to negate a positive match than it is to express a negative one as a single pattern, the model for doing so is sufficiently well-known that it becomes a mere matter of plugging things into that model. Given that:
/X/
matches something, then the way to express the condition
! /X/
in a single, positively-matching pattern is to write it as
/\A (?: (?! X ) . ) * \z /sx
Therefore, given that the positive pattern is
/ (\pL) .* \1 /sxi
the corresponding negative needs must be
/\A (?: (?! (\pL) .* \1 ) . ) * \z /sxi
by way of simple substitution for X.
Real-World Concerns
That said, there are extenuating concerns that may sometimes require more work. For example, while \pL describes any code point having the GeneralCategory=Letter property, it does not consider what to do with words like red‐violet–colored, ’Tisn’t, or fiancée — the latter of which is different in otherwise-equivalent NFD vs NFC forms.
You therefore must first run it through full decomposition, so that a string like "r\x{E9}sume\x{301}" would correctly detect the duplicate “letter é’s” — that is, all canonically equivalent grapheme cluster units.
To account for such as these, you must at a bare minimum first run your string through an NFD decomposition, and then afterwards also use grapheme clusters via \X instead of arbitrary code points via ..
So for English, you would want something that followed along these lines for the positive match, with the corresponding negative match per the substitution give above:
NFD($string) =~ m{
(?<ELEMENT>
(?= [\p{Alphabetic}\p{Dash}\p{Quotation_Mark}] ) \X
)
\X *
\k<ELEMENT>
}xi
But even with that there still remain certain outstanding issues unresolved, such as for example whether \N{EN DASH} and \N{HYPHEN} should be considered equivalent elements or different ones.
That’s because properly written, hyphenating two elements like red‐violet and colored to form the single compound word red‐violet–colored, where at least one of the pair already contains a hyphen, requires that one employ an EN DASH as the separator instead of a mere HYPHEN.
Normally the EN DASH is reserved for compounds of like nature, such as a time–space trade‐off. People using typewriter‐English don’t even do that, though, using that super‐massively overloaded legacy code point, HYPHEN-MINUS, for both: red-violet-colored.
It just depends whether your text came from some 19th‐century manual typewriter — or whether it represents English text properly rendered under modern typesetting rules. :)
Conscientious Case Insensitivity
You will note I am here considering letter that differ in case alone to be the same one. That’s because I use the /i regex switch, ᴀᴋᴀ the (?i) pattern modifier.
That’s rather like saying that they are the same as collation strength 1 — but not quite, because Perl uses only case folding (albeit full case folding not simple) for its case insensitive matches, not some higher collation strength than the tertiary level as might be preferred.
Full equivalence at the primary collation strength is a significantly stronger statement, but one that may well be needed to fully solve the problem in the general case. However, that requires a lot more work than the problem necessarily requires in many specific instances. In short, it is overkill for many specific cases that actually arise, no matter how much it might be needed for the hypothetical general case.
This is made even more difficult because, although you can for example do this:
my $collator = new Unicode::Collate::Locale::
level => 1,
locale => "de__phonebook",
normalization => undef,
;
if ($collator->cmp("müß", "MUESS") == 0) { ... }
and expect to get the right answer — and you do, hurray! — this sort of robust string comparison is not easily extended to regex matches.
Yet. :)
Summary
The choice of whether to under‐engineer — or to over‐engineer — a solution will vary according to individual circumstances, which no one can decide for you.
I like CJM’s solution that negates a positive match, myself, although it’s somewhat cavalier about what it considers a duplicate letter. Notice:
while ("de__phonebook" =~ /(?=((\w).*?\2))/g) {
print "The letter <$2> is duplicated in the substring <$1>.\n";
}
produces:
The letter <e> is duplicated in the substring <e__phone>.
The letter <_> is duplicated in the substring <__>.
The letter <o> is duplicated in the substring <onebo>.
The letter <o> is duplicated in the substring <oo>.
That shows why when you need to match a letter, you should alwasy use \pL ᴀᴋᴀ \p{Letter} instead of \w, which actually matches [\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}].
Of course, when you need to match an alphabetic, you need to use \p{alpha} ᴀᴋᴀ\p{Alphabetic}, which isn’t at all the same as a mere letter — contrary to popular misunderstanding. :)
If you're dealing with long strings that are likely to have duplicate letters, stopping ASAP may help.
INPUT: for (#input) {
my %seen;
while (/(.)/sg) {
next INPUT if $seen{$1}++;
}
say;
}
I'd go with the simplest solution unless the performance is found to be really unacceptable.
my #output = grep !/(.).*?\1/s, #input;
I was very curious about the relative speed of the various Perl-based methods submitted by other authors for this question. So, I decided to benchmark them.
Where necessary, I slightly modified each method so that it would populate an #output array, to keep the input and output consistent. I verified that all the methods produce the same #output, although I have not documented that assertion here.
Here is the script to benchmark the various methods:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese :hireswallclock);
# get a convenient list of words (on Mac OS X 10.6.6, this contains 234,936 entries)
open (my $fh, '<', '/usr/share/dict/words') or die "can't open words file: $!\n";
my #input = <$fh>;
close $fh;
# remove line breaks
chomp #input;
# set-up the tests (
my %tests = (
# Author: cjm
RegExp => sub { my #output = grep { not /(\w).*\1/ } #input },
# Author: daotoad
SplitCount => sub { my #output = grep { my #l = split ''; my %l; #l{#l} = (); keys %l == #l } #input; },
# Author: ikegami
NextIfSeen => sub {
my #output;
INPUT: for (#input) {
my %seen;
while (/(.)/sg) {
next INPUT if $seen{$1}++;
}
push #output, $_;
}
},
# Author: ysth
BitMask => sub {
my #output;
for my $word (#input) {
my $mask1 = $word x ( length($word) - 1 );
my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
push #output, $word;
}
}
},
);
# run each test 100 times
cmpthese(100, \%tests);
Here are the results for 100 iterations.
s/iter SplitCount BitMask NextIfSeen RegExp
SplitCount 2.85 -- -11% -58% -85%
BitMask 2.54 12% -- -53% -83%
NextIfSeen 1.20 138% 113% -- -64%
RegExp 0.427 567% 496% 180% --
As you can see, cjm's "RegExp" method is the fastest by far. It is 180% faster than the next fastest method, ikegami's "NextIfSeen" method. I suspect that the relative speed of the RegExp and NextIfSeen methods will converge as the average length of the input strings increases. But for "normal" length English words, the RegExp method is the fastest.
cjm gave the regex, but here's an interesting non-regex way:
#words = qw/abducts abe abeam abel abele/;
for my $word (#words) {
my $mask1 = $word x ( length($word) - 1 );
my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
print "$word\n";
}
}
In response to cjm's solution, I wondered about how it compared to some rather terse Perl:
my #output = grep { my #l = split ''; my %l; #l{#l} = (); keys %l == #l } #input;
Since I am not constrained in character count and formatting here, I'll be a bit clearer, even to the point of over-documenting:
my #output = grep {
# Split $_ on the empty string to get letters in $_.
my #letters = split '';
# Use a hash to remove duplicate letters.
my %unique_letters;
#unique_letters{#letters} = (); # This is a hash slice assignment.
# See perldoc perlvar for more info
# is the number of unique letters equal to the number of letters?
keys %unique_letters == #letters
} #input;
And, of course in production code, please do something like this:
my #output = grep ! has_repeated_chars($_), #input;
sub has_repeated_letters {
my $word = shift;
#blah blah blah
# see example above for the code to use here, with a nip and a tuck.
}
In python with a regex:
python -c 'import re, sys; print "".join(s for s in open(sys.argv[1]) if not re.match(r".*(\w).*\1", s))' wordlist.txt
In python without a regex:
python -c 'import sys; print "".join(s for s in open(sys.argv[1]) if len(s) == len(frozenset(s)))' wordlist.txt
I performed some timing tests with a hardcoded file name and output redirected to /dev/null to avoid including output in the timing:
Timings without the regex:
python -m timeit 'import sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if len(s) == len(frozenset(s)))' 2>/dev/null
10000 loops, best of 3: 91.3 usec per loop
Timings with the regex:
python -m timeit 'import re, sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if re.match(r".*(\w).*\1", s))' 2>/dev/null
10000 loops, best of 3: 105 usec per loop
Clearly the regex is a tiny bit slower than a simple frozenset creation and len comparison in python.
You can't do this with Regex. Regex is a Finite State Machine, and this would require a stack to store what letters have been seen.
I would suggest doing this with a foreach and manually check each word with code.
Something like
List chars
foreach word in list
foreach letter in word
if chars.contains letter then remove word from list
else
chars.Add letter
chars.clear
I'm coding a email application that produces messages for sending via SMTP. That means I need to change all lone \n and \r characters into the canonical \r\n sequence we all know and love. Here's the code I've got now:
CRLF = '\r\n'
msg = re.sub(r'(?<!\r)\n', CRLF, msg)
msg = re.sub(r'\r(?!\n)', CRLF, msg)
The problem is it's not very fast. On large messages (around 80k) it takes up nearly 30% of the time to send a message!
Can you do better? I eagerly await your Python gymnastics.
This regex helped:
re.sub(r'\r\n|\r|\n', '\r\n', msg)
But this code ended up winning:
msg.replace('\r\n','\n').replace('\r','\n').replace('\n','\r\n')
The original regexes took .6s to convert /usr/share/dict/words from \n to \r\n, the new regex took .3s, and the replace()s took .08s.
Maybe it is the fact that inserting an extra character in the middle of the string is killing it.
When you are substituting the text "hello \r world" it has to actually increase the size of the entire string by one character to "hello \r\n world" .
I would suggest looping over the string and looking at characters one by one. If it is not a \r or \n then just append it to the new string. If it is a \r or \n append the new string with the correct values
Code in C# (converting to python should be trivial)
string FixLineEndings(string input)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
StringBuilder rv = new StringBuilder(input.Length);
for(int i = 0; i < input.Length; i++)
{
char c = input[i];
if (c != '\r' && c != '\n')
{
rv.Append(c);
}
else if (c == '\n')
{
rv.Append("\r\n");
}
else if (c == '\r')
{
if (i == input.Length - 1)
{
rv.Append("\r\n"); //a \r at the end of the string
}
else if (input[i + 1] != '\n')
{
rv.Append("\r\n");
}
}
}
return rv.ToString();
}
This was interesting enough to go write up a sample program to test. I used the regex given in the other answer and the code for using the regex was:
static readonly Regex _r1 = new Regex(#"(?
I tried with a bunch of test cases. The outputs are:
------------------------
Size: 1000 characters
All\r
String: 00:00:00.0038237
Regex : 00:00:00.0047669
All\r\n
String: 00:00:00.0001745
Regex : 00:00:00.0009238
All\n
String: 00:00:00.0024014
Regex : 00:00:00.0029281
No \r or \n
String: 00:00:00.0000904
Regex : 00:00:00.0000628
\r at every 100th position and \n at every 102th position
String: 00:00:00.0002232
Regex : 00:00:00.0001937
------------------------
Size: 10000 characters
All\r
String: 00:00:00.0010271
Regex : 00:00:00.0096480
All\r\n
String: 00:00:00.0006441
Regex : 00:00:00.0038943
All\n
String: 00:00:00.0010618
Regex : 00:00:00.0136604
No \r or \n
String: 00:00:00.0006781
Regex : 00:00:00.0001943
\r at every 100th position and \n at every 102th position
String: 00:00:00.0006537
Regex : 00:00:00.0005838
which show the string replacing function doing better in cases where the number of \r and \n's are high. For regular use though the original regex approach is much faster (see the last set of test cases - the ones w/o \r\n and with few \r's and \n's)
This was of course coded in C# and not python but i'm guessing there would be similarities in the run times across languages
Replace them on the fly as you're writing the string to wherever it's going. If you use a regex or anything else you'll be making two passes: one to replace the characters and then one to write it. Deriving a new Stream class and wrapping it around whatever you're writing to is pretty effective; that's the way we do it with System.Net.Mail and that means I can use the same stream encoder for writing to both files and network streams. I'd have to see some of your code in order to give you a really good way to do this though. Also, keep in mind that the actual replacement won't really be any faster, however the total execution time would be reduced since you're only making one pass instead of two (assuming you actually are writing the output of the email somewhere).
You could start by pre-compiling the regexes, e.g.
FIXCR = re.compile(r'\r(?!\n)')
FIXLN = re.compile(r'(?<!\r)\n')
Then use FIXCR.sub and FIXLN.sub. Next, you could try to combine the regexes into one, with a | thingy, which should also help.
Something like this? Compile your regex.
CRLF = '\r\n'
cr_or_lf_regex = re.compile(r'(?:(?<!\r)\n)|(?:\r(?!\n))')
Then, when you want to replace stuff use this:
cr_or_lf_regex.sub(CRLF, msg)
EDIT: Since the above is actually slower, let me take another stab at it.
last_chr = ''
def fix_crlf(input_chr):
global last_chr
if input_chr != '\r' and input_chr != '\n' and last_chr != '\r':
result = input_chr
else:
if last_chr == '\r' and input_chr == '\n': result = '\r\n'
elif last_chr != '\r' and input_chr == '\n': result = '\r\n'
elif last_chr == '\r' and input_chr != '\n': result = '\r\n%s' % input_chr
else: result = ''
last_chr = input_chr
return result
fixed_msg = ''.join([fix_crlf(c) for c in msg])