regex: how to get repeating blocks as groups()? [duplicate] - python

I need to capture multiple groups of the same pattern. Suppose, I have the following string:
HELLO,THERE,WORLD
And I've written the following pattern
^(?:([A-Z]+),?)+$
What I want it to do is to capture every single word, so that Group 1 is : "HELLO", Group 2 is "THERE" and Group 3 is "WORLD". What my regex is actually capturing is only the last one, which is "WORLD".
I'm testing my regular expression here and I want to use it with Swift (maybe there's a way in Swift to get intermediate results somehow, so that I can use them?)
UPDATE: I don't want to use split. I just need to now how to capture all the groups that match the pattern, not only the last one.

With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$

The key distinction is repeating a captured group instead of capturing a repeated group.
As you have already found out, the difference is that repeating a captured group captures only the last iteration. Capturing a repeated group captures all iterations.
In PCRE (PHP):
((?:\w+)+),?
Match 1, Group 1. 0-5 HELLO
Match 2, Group 1. 6-11 THERE
Match 3, Group 1. 12-20 BRUTALLY
Match 4, Group 1. 21-26 CRUEL
Match 5, Group 1. 27-32 WORLD
Since all captures are in Group 1, you only need $1 for substitution.
I used the following general form of this regular expression:
((?:{{RE}})+)
Example at regex101

I think you need something like this....
b="HELLO,THERE,WORLD"
re.findall('[\w]+',b)
Which in Python3 will return
['HELLO', 'THERE', 'WORLD']

After reading Byte Commander's answer, I want to introduce a tiny possible improvement:
You can generate a regexp that will match either n words, as long as your n is predetermined. For instance, if I want to match between 1 and 3 words, the regexp:
^([A-Z]+)(?:,([A-Z]+))?(?:,([A-Z]+))?$
will match the next sentences, with one, two or three capturing groups.
HELLO,LITTLE,WORLD
HELLO,WORLD
HELLO
You can see a fully detailed explanation about this regular expression on Regex101.
As I said, it is pretty easy to generate this regexp for any groups you want using your favorite language. Since I'm not much of a swift guy, here's a ruby example:
def make_regexp(group_regexp, count: 3, delimiter: ",")
regexp_str = "^(#{group_regexp})"
(count - 1).times.each do
regexp_str += "(?:#{delimiter}(#{group_regexp}))?"
end
regexp_str += "$"
return regexp_str
end
puts make_regexp("[A-Z]+")
That being said, I'd suggest not using regular expression in that case, there are many other great tools from a simple split to some tokenization patterns depending on your needs. IMHO, a regular expression is not one of them. For instance in ruby I'd use something like str.split(",") or str.scan(/[A-Z]+/)

Just to provide additional example of paragraph 2 in the answer. I'm not sure how critical it is for you to get three groups in one match rather than three matches using one group. E.g., in groovy:
def subject = "HELLO,THERE,WORLD"
def pat = "([A-Z]+)"
def m = (subject =~ pat)
m.eachWithIndex{ g,i ->
println "Match #$i: ${g[1]}"
}
Match #0: HELLO
Match #1: THERE
Match #2: WORLD

The problem with the attempted code, as discussed, is that there is one capture group matching repeatedly so in the end only the last match can be kept.
Instead, instruct the regex to match (and capture) all pattern instances in the string, what can be done in any regex implementation (language). So come up with the regex pattern for this.
The defining property of the shown sample data is that the patterns of interest are separated by commas so we can match anything-but-a-comma, using a negated character class
[^,]+
and match (capture) globally, to get all matches in the string.
If your pattern need be more restrictive then adjust the exclusion list. For example, to capture words separated by any of the listed punctuation
[^,.!-]+
This extracts all words from hi,there-again!, without the punctuation. (The - itself should be given first or last in a character class, unless it's used in a range like a-z or 0-9.)
In Python
import re
string = "HELLO,THERE,WORLD"
pattern = r"([^,]+)"
matches = re.findall(pattern,string)
print(matches)
In Perl (and many other compatible systems)
use warnings;
use strict;
use feature 'say';
my $string = 'HELLO,THERE,WORLD';
my #matches = $string =~ /([^,]+)/g;
say "#matches";
(In this specific example the capturing () in fact aren't needed since we collect everything that is matched. But they don't hurt and in general they are needed.)
The approach above works as it stands for other patterns as well, including the one attempted in the question (as long as you remove the anchors which make it too specific). The most common one is to capture all words (usually meaning [a-zA-Z0-9_]), with the pattern \w+. Or, as in the question, get only the substrings of upper-case ascii letters[A-Z]+.

I know that my answer came late but it happens to me today and I solved it with the following approach:
^(([A-Z]+),)+([A-Z]+)$
So the first group (([A-Z]+),)+ will match all the repeated patterns except the final one ([A-Z]+) that will match the final one. and this will be dynamic no matter how many repeated groups in the string.

You actually have one capture group that will match multiple times. Not multiple capture groups.
javascript (js) solution:
let string = "HI,THERE,TOM";
let myRegexp = /([A-Z]+),?/g; // modify as you like
let match = myRegexp.exec(string); // js function, output described below
while (match != null) { // loops through matches
console.log(match[1]); // do whatever you want with each match
match = myRegexp.exec(string); // find next match
}
Syntax:
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
As you can see, this will work for any number of matches.

Sorry, not Swift, just a proof of concept in the closest language at hand.
// JavaScript POC. Output:
// Matches: ["GOODBYE","CRUEL","WORLD","IM","LEAVING","U","TODAY"]
let str = `GOODBYE,CRUEL,WORLD,IM,LEAVING,U,TODAY`
let matches = [];
function recurse(str, matches) {
let regex = /^((,?([A-Z]+))+)$/gm
let m
while ((m = regex.exec(str)) !== null) {
matches.unshift(m[3])
return str.replace(m[2], '')
}
return "bzzt!"
}
while ((str = recurse(str, matches)) != "bzzt!") ;
console.log("Matches: ", JSON.stringify(matches))
Note: If you were really going to use this, you would use the position of the match as given by the regex match function, not a string replace.

Design a regex that matches each particular element of the list rather then a list as a whole. Apply it with /g
Iterate throught the matches, cleaning them from any garbage such as list separators that got mixed in. You may require another regex, or you can get by with simple replace substring method.
The sample code is in JS, sorry :) The idea must be clear enough.
const string = 'HELLO,THERE,WORLD';
// First use following regex matches each of the list items separately:
const captureListElement = /^[^,]+|,\w+/g;
const matches = string.match(captureListElement);
// Some of the matches may include the separator, so we have to clean them:
const cleanMatches = matches.map(match => match.replace(',',''));
console.log(cleanMatches);

repeat the A-Z pattern in the group for the regular expression.
data="HELLO,THERE,WORLD"
pattern=r"([a-zA-Z]+)"
matches=re.findall(pattern,data)
print(matches)
output
['HELLO', 'THERE', 'WORLD']

Related

regex subtitution [duplicate]

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

How to regex for a numerical suffix?

I have the following regex (example is in Python):
pattern = re.compile(r'^(([a-zA-Z0-9]*[a-zA-Z]+)([\d]+)|([\d]+))$')
This correctly parses any string that has a numerical suffix and an optional prefix that is alphanumerics:
a123
a2a123
123
All will correctly see 123 as a suffix. It will correctly reject bad inputs:
abc
123abc
()123 # Or other non-alphanumerics
The regex itself is fairly unwieldy, though, and several of the capture groups are often empty as a result, meaning I have to go through the additional step of filtering them out. I am curious if there is a better way to be thinking about this regex than "a number OR a number preceeded by an alphanumeric that ends in a character"?
You may use
^[A-Za-z0-9]*?([0-9]+)$
See the regex demo
Details
^ - start of string
[A-Za-z0-9]*? - any letters/digits, zero or more times, as few as possible (due to this non-greedy matching, the next pattern, ([0-9]+), will match all digits at the end of the string there are)
([0-9]+) - Group 1: one or more digits
$ - end of string.
In Python:
m = re.search(r'^[A-Za-z0-9]*?([0-9]+)$') # Or, see below
# m = re.match(r'[A-Za-z0-9]*?([0-9]+)$') # re.match only searches at the start of the string
# m = re.fullmatch(r'[A-Za-z0-9]*?([0-9]+)') # Only in Python 3.x
if m:
print(m.group(1))
If you use non-capturing groups and a correct management of repetitions, the problem eases itself.
pattern = re.compile(r'^(?:[a-zA-Z0-9]*[a-zA-Z]+)?([0-9]+)$')
There's only one capturing group (group 1) for the suffix, and the alphanumerics before it is not captured.
Alternatively, using named groups is another option, and it often makes long, structured regexes easier to maintain:
pattern = re.compile(r'^(?P<a>[a-zA-Z0-9]*[a-zA-Z]+)?(?P<suffix>[0-9]+)$')

Regex in python repetition Error

In my code I Want answer [('22', '254', '15', '36')] but got [('15', '36')]. My regex (?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3} is not run for 3 time may be!
import re
def fun(st):
print(re.findall("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
ip="22.254.15.36"
print(fun(ip))
Overview
As I mentioned in the comments below your question, most regex engines only capture the last match. So when you do (...){3}, only the last match is captured: E.g. (.){3} used against abc will only return c.
Also, note that changing your regex to (2[0-4]\d|25[0-5]|[01]?\d{1,2}) performs much better and catches full numbers (currently you'll grab 25 instead of 255 on the last octet for example - unless you anchor it to the end).
To give you a fully functional regex for capturing each octet of the IP:
(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})\.(2[0-4]\d|25[0-5]|[01]?\d{1,2})
Personally, however, I'd separate the logic from the validation. The code below first validates the format of the string and then checks whether or not the logic (no octets greater than 255) passes while splitting the string on ..
Code
See code in use here
import re
ip='22.254.15.36'
if re.match(r"(?:\d{1,3}\.){3}\d{1,3}$", ip):
print([octet for octet in ip.split('.') if int(octet) < 256])
Result: ['22', '254', '15', '36']
If you're using this method to extract IPs from an arbitrary string, you can replace re.match() with re.search() or re.findall(). In that case you may want to remove $ and add some logic to ensure you're not matching special cases like 11.11.11.11.11: (?<!\d\.)\b(?:\d{1,3}\.){3}\d{1,3}\b(?!\.\d)
You only have two capturing groups in your regex:
(?: # non-capturing group
( # group 1
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)\.
){3}
( # group 2
[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?
)
That the first group can be repeated 3 times doesn't make it capture 3 times. The regex engine will only ever return 2 groups, and the last match in a given group will fill that group.
If you want to capture each of the parts of an IP address into separate groups, you'll have to explicitly define groups for each:
pattern = (
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.'
r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)')
def fun(st, p=re.compile(pattern)):
return p.findall(st)
You could avoid that much repetition with a little string and list manipulation:
octet = r'([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)'
pattern = r'\.'.join([octet] * 4)
Next, the pattern will just as happily match the 25 portion of 255. Better to put matching of the 200-255 range at the start over matching smaller numbers:
octet = r'(2(?:5[0-5]|[0-4]\d)|[01]?[0-9]{1,2})'
pattern = r'\.'.join([octet] * 4)
This still allows leading 0 digits, by the way, but is
If all you are doing is passing in single IP addresses, then re.findall() is overkill, just use p.match() (matching only at the string start) or p.search(), and return the .groups() result if there is a match;)
def fun(st, p=re.compile(pattern + '$')):
match = p.match(st)
return match and match.groups()
Note that no validation is done on the surrounding data, so if you are trying to extract IP addresses from a larger body of text you can't use re.match(), and can't add the $ anchor and the match could be from a larger number of octets (e.g. 22.22.22.22.22.22). You'd have to add some look-around operators for that:
# only match an IP address if there is no indication that it is part of a larger
# set of octets; no leading or trailing dot or digits
pattern = r'(?<![\.\d])' + pattern + r'(?![\.\d])'
I encountered a very similar issue.
I found two solutions, using the official documentation.
The answer of #ctwheels above did mention the cause of the problem, and I really appreciate it, but it did not provide a solution.
Even when trying the lookbehind and the lookahead, it did not work.
First solution:
re.finditer
re.finditer iterates over match objects !!
You can use each one's 'group' method !
>>> def fun(st):
pr=re.finditer("(?:([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}([0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st)
for p in pr:
print(p.group(),end="")
>>> fun(ip)
22.254.15.36
Or !!!
Another solution haha : You can still use findall, but you'll have to make every group a non-capturing group ! (Since the main problem is not with findall, but with the group function that is used by findall (which, we all know, only returns the last match):
"re.findall:
...If one or more groups are present in the pattern, return a list of groups"
(Python 3.8 Manuals)
So:
>>> def fun(st):
print(re.findall("(?:(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)\.){3}(?:[0-1]?[0-9]{0,2}|2?[0-4]?[0-9]|25[0-5]?)",st))
>>> fun(ip)
['22.254.15.36']
Have fun !

several replacements in single regular expression

I'm trying to read a set of data from a file such that it can be cast to complex. The entries are of the form
line='0.2741564350068515+2.6100840481550604*^-10*I\n',
which is supposed to be rendered as
'(0.2741564350068515+2.6100840481550604e-10j)'.
Hence I need to insert the pair of parentheses and change the symbols for imaginary unit and exponential notation. My clumsy solution is to perform each substitution individually,
re.sub("\*\^","e",re.sub("[\.]{0,1}\*I","j)",re.sub("(^)","(",line))).strip(),
but this is not exactly readable, or sane. Is there a way to use a single regex to do this substitution?
It seems that you can do without a regex at all:
line='0.2741564350068515+2.6100840481550604*^-10*I\n'
print("({})".format(line.strip().replace("*^", "e").replace("*I", "j")))
# => (0.2741564350068515+2.6100840481550604e-10j)
See the IDEONE demo
A "funny" regex way showing how to use capturing groups and check what was captured in the replacement with a lambda:
import re
line='0.2741564350068515+2.6100840481550604*^-10*I\n'
print("({})".format(re.sub(r"(\*\^)|([.]?\*I)", lambda m: "e" if m.group(1) else "j", line.strip())))
# => (0.2741564350068515+2.6100840481550604e-10j)
If Group 1 ((\*\^)) was matched we replace with e, if Group 2 matched, replace with j.
Note that {0,1} limiting quantifier means the same as ? quantifier - 1 or 0 times.
The easiest way to do this with regex is to make a pattern that matches the whole number and captures all the important parts in capture groups:
(.*?)\*\^(.*?)\*I
This would capture 0.2741564350068515+2.6100840481550604 in group 1 and 10 in group 2, so substituting with (\1e\2j) will give you the expected result:
(0.2741564350068515+2.6100840481550604e-10j)

Is this possible using regular expression

I am using Python 2.7 and I am fairly familiar with using regular expressions and how to use them in Python. I would like to use a regex to replace comma delimiters with a semicolon. The problem is that data wrapped in double qoutes should retain embedded commas. Here is an example:
Before:
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After:
"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
Is there a single regex that can do this?
This is an other way that avoids to test all the string until the end with a lookahead for each occurrence. It's a kind of (more or less) \G feature emulation for re module.
Instead of testing what comes after the comma, this pattern find the item before the comma (and the comma obviously) and is written in a way that makes each whole match consecutive to the precedent.
re.sub(r'(?:(?<=,)|^)(?=("(?:"")*(?:[^"]+(?:"")*)*"|[^",]*))\1,', r'\1;', s)
online demo
details:
(?: # ensures that results are contiguous
(?<=,) # preceded by a comma (so, the one of the last result)
| # OR
^ # at the start of the string
)
(?= # (?=(a+))\1 is a way to emulate an atomic group: (?>a+)
( # capture the precedent item in group 1
"(?:"")*(?:[^"]+(?:"")*)*" # an item between quotes
|
[^",]* # an item without quotes
)
) \1 # back-reference for the capture group 1
,
The advantage of this way is that it reduces the number of steps to obtain a match and provides a near from constant number of steps whatever the item before (see the regex101 debugger). The reason is that all characters are matched/tested only once. So even the pattern is more long, it is more efficient (and the gain grow up in particular with long lines)
The atomic group trick is only here to reduce the number of steps before failing for the last item (that is not followed by a comma).
Note that the pattern deals with items between quotes with escaped quotes (two consecutive quotes) inside: "abcd""efgh""ijkl","123""456""789",foo
# Python 2.7
import re
text = '''
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
'''.strip()
print "Before: " + text
print "After: " + ";".join(re.findall(r'(?:"[^"]+"|[^,]+)', text))
This produces the following output:
Before: "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After: "3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
You can tinker with this here if you need more customization.
You can use:
>>> s = 'foo bar,"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"'
>>> print re.sub(r'(?=(([^"]*"){2})*[^"]*$),', ';', s)
foo bar;"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
RegEx Demo
This will match comma only if it is outside quote by matching even number of quotes after ,.
This regex seems to do the job
,(?=(?:[^"]*"[^"]*")*[^"]*\Z)
Adapted from:
How to match something with regex that is not between two special characters?
And tested with http://pythex.org/
You can split with regex and then join it :
>>> ';'.join([i.strip(',') for i in re.split(r'(,?"[^"]*",?)?',s) if i])
'"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"'

Categories

Resources