Is this possible using regular expression - python

I am using Python 2.7 and I am fairly familiar with using regular expressions and how to use them in Python. I would like to use a regex to replace comma delimiters with a semicolon. The problem is that data wrapped in double qoutes should retain embedded commas. Here is an example:
Before:
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After:
"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
Is there a single regex that can do this?

This is an other way that avoids to test all the string until the end with a lookahead for each occurrence. It's a kind of (more or less) \G feature emulation for re module.
Instead of testing what comes after the comma, this pattern find the item before the comma (and the comma obviously) and is written in a way that makes each whole match consecutive to the precedent.
re.sub(r'(?:(?<=,)|^)(?=("(?:"")*(?:[^"]+(?:"")*)*"|[^",]*))\1,', r'\1;', s)
online demo
details:
(?: # ensures that results are contiguous
(?<=,) # preceded by a comma (so, the one of the last result)
| # OR
^ # at the start of the string
)
(?= # (?=(a+))\1 is a way to emulate an atomic group: (?>a+)
( # capture the precedent item in group 1
"(?:"")*(?:[^"]+(?:"")*)*" # an item between quotes
|
[^",]* # an item without quotes
)
) \1 # back-reference for the capture group 1
,
The advantage of this way is that it reduces the number of steps to obtain a match and provides a near from constant number of steps whatever the item before (see the regex101 debugger). The reason is that all characters are matched/tested only once. So even the pattern is more long, it is more efficient (and the gain grow up in particular with long lines)
The atomic group trick is only here to reduce the number of steps before failing for the last item (that is not followed by a comma).
Note that the pattern deals with items between quotes with escaped quotes (two consecutive quotes) inside: "abcd""efgh""ijkl","123""456""789",foo

# Python 2.7
import re
text = '''
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
'''.strip()
print "Before: " + text
print "After: " + ";".join(re.findall(r'(?:"[^"]+"|[^,]+)', text))
This produces the following output:
Before: "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After: "3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
You can tinker with this here if you need more customization.

You can use:
>>> s = 'foo bar,"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"'
>>> print re.sub(r'(?=(([^"]*"){2})*[^"]*$),', ';', s)
foo bar;"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
RegEx Demo
This will match comma only if it is outside quote by matching even number of quotes after ,.

This regex seems to do the job
,(?=(?:[^"]*"[^"]*")*[^"]*\Z)
Adapted from:
How to match something with regex that is not between two special characters?
And tested with http://pythex.org/

You can split with regex and then join it :
>>> ';'.join([i.strip(',') for i in re.split(r'(,?"[^"]*",?)?',s) if i])
'"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"'

Related

regex: how to get repeating blocks as groups()? [duplicate]

I need to capture multiple groups of the same pattern. Suppose, I have the following string:
HELLO,THERE,WORLD
And I've written the following pattern
^(?:([A-Z]+),?)+$
What I want it to do is to capture every single word, so that Group 1 is : "HELLO", Group 2 is "THERE" and Group 3 is "WORLD". What my regex is actually capturing is only the last one, which is "WORLD".
I'm testing my regular expression here and I want to use it with Swift (maybe there's a way in Swift to get intermediate results somehow, so that I can use them?)
UPDATE: I don't want to use split. I just need to now how to capture all the groups that match the pattern, not only the last one.
With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$
The key distinction is repeating a captured group instead of capturing a repeated group.
As you have already found out, the difference is that repeating a captured group captures only the last iteration. Capturing a repeated group captures all iterations.
In PCRE (PHP):
((?:\w+)+),?
Match 1, Group 1. 0-5 HELLO
Match 2, Group 1. 6-11 THERE
Match 3, Group 1. 12-20 BRUTALLY
Match 4, Group 1. 21-26 CRUEL
Match 5, Group 1. 27-32 WORLD
Since all captures are in Group 1, you only need $1 for substitution.
I used the following general form of this regular expression:
((?:{{RE}})+)
Example at regex101
I think you need something like this....
b="HELLO,THERE,WORLD"
re.findall('[\w]+',b)
Which in Python3 will return
['HELLO', 'THERE', 'WORLD']
After reading Byte Commander's answer, I want to introduce a tiny possible improvement:
You can generate a regexp that will match either n words, as long as your n is predetermined. For instance, if I want to match between 1 and 3 words, the regexp:
^([A-Z]+)(?:,([A-Z]+))?(?:,([A-Z]+))?$
will match the next sentences, with one, two or three capturing groups.
HELLO,LITTLE,WORLD
HELLO,WORLD
HELLO
You can see a fully detailed explanation about this regular expression on Regex101.
As I said, it is pretty easy to generate this regexp for any groups you want using your favorite language. Since I'm not much of a swift guy, here's a ruby example:
def make_regexp(group_regexp, count: 3, delimiter: ",")
regexp_str = "^(#{group_regexp})"
(count - 1).times.each do
regexp_str += "(?:#{delimiter}(#{group_regexp}))?"
end
regexp_str += "$"
return regexp_str
end
puts make_regexp("[A-Z]+")
That being said, I'd suggest not using regular expression in that case, there are many other great tools from a simple split to some tokenization patterns depending on your needs. IMHO, a regular expression is not one of them. For instance in ruby I'd use something like str.split(",") or str.scan(/[A-Z]+/)
Just to provide additional example of paragraph 2 in the answer. I'm not sure how critical it is for you to get three groups in one match rather than three matches using one group. E.g., in groovy:
def subject = "HELLO,THERE,WORLD"
def pat = "([A-Z]+)"
def m = (subject =~ pat)
m.eachWithIndex{ g,i ->
println "Match #$i: ${g[1]}"
}
Match #0: HELLO
Match #1: THERE
Match #2: WORLD
The problem with the attempted code, as discussed, is that there is one capture group matching repeatedly so in the end only the last match can be kept.
Instead, instruct the regex to match (and capture) all pattern instances in the string, what can be done in any regex implementation (language). So come up with the regex pattern for this.
The defining property of the shown sample data is that the patterns of interest are separated by commas so we can match anything-but-a-comma, using a negated character class
[^,]+
and match (capture) globally, to get all matches in the string.
If your pattern need be more restrictive then adjust the exclusion list. For example, to capture words separated by any of the listed punctuation
[^,.!-]+
This extracts all words from hi,there-again!, without the punctuation. (The - itself should be given first or last in a character class, unless it's used in a range like a-z or 0-9.)
In Python
import re
string = "HELLO,THERE,WORLD"
pattern = r"([^,]+)"
matches = re.findall(pattern,string)
print(matches)
In Perl (and many other compatible systems)
use warnings;
use strict;
use feature 'say';
my $string = 'HELLO,THERE,WORLD';
my #matches = $string =~ /([^,]+)/g;
say "#matches";
(In this specific example the capturing () in fact aren't needed since we collect everything that is matched. But they don't hurt and in general they are needed.)
The approach above works as it stands for other patterns as well, including the one attempted in the question (as long as you remove the anchors which make it too specific). The most common one is to capture all words (usually meaning [a-zA-Z0-9_]), with the pattern \w+. Or, as in the question, get only the substrings of upper-case ascii letters[A-Z]+.
I know that my answer came late but it happens to me today and I solved it with the following approach:
^(([A-Z]+),)+([A-Z]+)$
So the first group (([A-Z]+),)+ will match all the repeated patterns except the final one ([A-Z]+) that will match the final one. and this will be dynamic no matter how many repeated groups in the string.
You actually have one capture group that will match multiple times. Not multiple capture groups.
javascript (js) solution:
let string = "HI,THERE,TOM";
let myRegexp = /([A-Z]+),?/g; // modify as you like
let match = myRegexp.exec(string); // js function, output described below
while (match != null) { // loops through matches
console.log(match[1]); // do whatever you want with each match
match = myRegexp.exec(string); // find next match
}
Syntax:
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
As you can see, this will work for any number of matches.
Sorry, not Swift, just a proof of concept in the closest language at hand.
// JavaScript POC. Output:
// Matches: ["GOODBYE","CRUEL","WORLD","IM","LEAVING","U","TODAY"]
let str = `GOODBYE,CRUEL,WORLD,IM,LEAVING,U,TODAY`
let matches = [];
function recurse(str, matches) {
let regex = /^((,?([A-Z]+))+)$/gm
let m
while ((m = regex.exec(str)) !== null) {
matches.unshift(m[3])
return str.replace(m[2], '')
}
return "bzzt!"
}
while ((str = recurse(str, matches)) != "bzzt!") ;
console.log("Matches: ", JSON.stringify(matches))
Note: If you were really going to use this, you would use the position of the match as given by the regex match function, not a string replace.
Design a regex that matches each particular element of the list rather then a list as a whole. Apply it with /g
Iterate throught the matches, cleaning them from any garbage such as list separators that got mixed in. You may require another regex, or you can get by with simple replace substring method.
The sample code is in JS, sorry :) The idea must be clear enough.
const string = 'HELLO,THERE,WORLD';
// First use following regex matches each of the list items separately:
const captureListElement = /^[^,]+|,\w+/g;
const matches = string.match(captureListElement);
// Some of the matches may include the separator, so we have to clean them:
const cleanMatches = matches.map(match => match.replace(',',''));
console.log(cleanMatches);
repeat the A-Z pattern in the group for the regular expression.
data="HELLO,THERE,WORLD"
pattern=r"([a-zA-Z]+)"
matches=re.findall(pattern,data)
print(matches)
output
['HELLO', 'THERE', 'WORLD']

Regex to extract the string

I need help with regex to get the following out of the string
dal001.caxxxxx.test.com. ---> caxxxxx.test.com
caxxxx.test.com -----> caxxxx.test.com
So basically in the first example, I don't want dal001 or anything that starts with 3 letters and 3 digits and want the rest of the string if it starts with only ca.
In second example I want the whole string that starts only with ca.
So far I have tried (^[a-z]{3}[\d]+\.)?(ca.*) but it doesn't work when the string is
dal001.mycaxxxx.test.com.
Any help would be appreciated.
You can use
^(?:[a-z]{3}\d{3}\.)?(ca.*)
See the regex demo. To make it case insensitive, compile with re.I (re.search(rx, s, re.I), see below).
Details:
^ - start of string
(?:[a-z]{3}\d{3}\.)? - an optional sequence of 3 letters and then 3 digits and a .
(ca.*) - Group 1: ca and the rest of the string.
See the Python demo:
import re
rx = r"^(?:[a-z]{3}\d{3}\.)?(ca.*)"
strs = ["dal001.caxxxxx.test.com","caxxxx.test.com"]
for s in strs:
m = re.search(rx, s)
if m:
print( m.group(1) )
Use re.sub like so:
import re
strs = ['dal001.caxxxxx.test.com', 'caxxxx.test.com']
for s in strs:
s = re.sub(r'^[A-Za-z]{3}\d{3}[.]', '', s)
print(s)
# caxxxxx.test.com
# caxxxx.test.com
if you are using re:
import re
my_strings = ['dal001.caxxxxx.test.com', 'caxxxxx.test.com']
my_regex = r'^(?:[a-zA-Z]{3}[0-9]{3}\.)?(ca.*)'
compiled_regex = re.compile(r)
for a_string in my_strings:
if compiled_regex.match(a_string):
compiled_regex.sub(r'\1', a_string)
my_regex matches a string that starts (^ anchors to the start of the string) with [3 letters][3 digits][a .], but only optionally, and using a non-capturing group (the (?:) will not get a numbered reference to use in sub). In either case, it must then contain ca followed by anything, and this part is used as the replacement in the call to re.sub. re.compile is used to make it a bit faster, in case you have many strings to match.
Note on re.compile:
Some answers don't bother pre-compiling the regex before the loop. They have made a trade: removing a single line of code, at the cost of re-compiling the regex implicitly on every iteration. If you will use a regex in a loop body, you should always compile it first. Doing so can have a major effect on the speed of a program, and there is no added cost even when the number of iterations is small. Here is a comparison of compiled vs. non-compiled versions of the same loop using the same regex for different numbers of loop iterations and number of trials. Judge for yourself.

repetition in regular expression in python

I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(
You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.
re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.
Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)

python. re.findall and re.sub with '^'

I try to change string like s='2.3^2+3^3-√0.04*2+√4',
where 2.3^2 has to change to pow(2.3,2), 3^3 - pow(3,3), √0.04 - sqrt(0.04) and
√4 - sqrt(4).
s='2.3^2+3^3-√0.04*2+√4'
patt1='[0-9]+\.[0-9]+\^[0-9]+|[0-9]+\^[0-9]'
patt2='√[0-9]+\.[0-9]+|√[0-9]+'
idx1=re.findall(patt1, s)
idx2=re.findall(patt2, s)
idx11=[]
idx22=[]
for i in range(len(idx1)):
idx11.append('pow('+idx1[i][:idx1[i].find('^')]+','+idx1[i][idx1[i].find('^')+1:]+')')
for i in range(len(idx2)):
idx22.append('sqrt('+idx2[i][idx2[i].find('√')+1:]+')')
for i in range(len(idx11)):
s=re.sub(idx1[i], idx11[i], s)
for i in range(len(idx22)):
s=re.sub(idx2[i], idx22[i], s)
print(s)
Temp results:
idx1=['2.3^2', '3^3']
idx2=['√0.04', '√4']
idx11=['pow(2.3,2)', 'pow(3,3)']
idx22=['sqrt(0.04)', 'sqrt(4)']
but string result:
2.3^2+3^3-sqrt(0.04)*2+sqrt(4)
Why calculating 'idx1' is right, but re.sub don't insert this value into string ?
(sorry for my english:)
Try this using only re.sub()
Input string:
s='2.3^2+3^3-√0.04*2+√4'
Replacing for pow()
s = re.sub("(\d+(?:\.\d+)?)\^(\d+)", "pow(\\1,\\2)", s)
Replacing for sqrt()
s = re.sub("√(\d+(?:\.\d+)?)", "sqrt(\\1)", s)
Output:
pow(2.3,2)+pow(3,3)-sqrt(0.04)*2+sqrt(4)
() means group capture and \\1 means first captured group from regex match. Using this link you can get the detail explanation for the regex.
I've only got python 2.7.5 but this works for me, using str.replace rather than re.sub. Once you've gone to the effort of finding the matches and constructing their replacements, this is a simple find and replace job:
for i in range(len(idx11)):
s = s.replace(idx1[i], idx11[i])
for i in range(len(idx22)):
s = s.replace(idx2[i], idx22[i])
edit
I think you're going about this in quite a long-winded way. You can use re.sub in one go to make these changes:
s = re.sub('(\d+(\.\d+)?)\^(\d+)', r'pow(\1,\3)', s)
Will substitute 2.3^2+3^3 for pow(2.3,2)+pow(3,3) and:
s = re.sub('√(\d+(\.\d+)?)', r'sqrt(\1)', s)
Will substitute √0.04*2+√4 to sqrt(0.04)*2+sqrt(4)
There's a few things going on here that are different. Firstly, \d, which matches a digit, the same as [0-9]. Secondly, the ( ) capture whatever is inside them. In the replacement, you can refer to these captured groups by the order in which they appear. In the pow example I'm using the first and third group that I have captured.
The prefix r before the replacement string means that the string is to be treated as "raw", so characters are interpreted literally. The groups are accessed by \1, \2 etc. but because the backslash \ is an escape character, I would have to escape it each time (\\1, \\2, etc.) without the r.

Regular expression for repeating sequence

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Categories

Resources