python. re.findall and re.sub with '^'

python. re.findall and re.sub with '^' - python

I try to change string like s='2.3^2+3^3-√0.04*2+√4',
where 2.3^2 has to change to pow(2.3,2), 3^3 - pow(3,3), √0.04 - sqrt(0.04) and
√4 - sqrt(4).
s='2.3^2+3^3-√0.04*2+√4'
patt1='[0-9]+\.[0-9]+\^[0-9]+|[0-9]+\^[0-9]'
patt2='√[0-9]+\.[0-9]+|√[0-9]+'
idx1=re.findall(patt1, s)
idx2=re.findall(patt2, s)
idx11=[]
idx22=[]
for i in range(len(idx1)):
idx11.append('pow('+idx1[i][:idx1[i].find('^')]+','+idx1[i][idx1[i].find('^')+1:]+')')
for i in range(len(idx2)):
idx22.append('sqrt('+idx2[i][idx2[i].find('√')+1:]+')')
for i in range(len(idx11)):
s=re.sub(idx1[i], idx11[i], s)
for i in range(len(idx22)):
s=re.sub(idx2[i], idx22[i], s)
print(s)
Temp results:
idx1=['2.3^2', '3^3']
idx2=['√0.04', '√4']
idx11=['pow(2.3,2)', 'pow(3,3)']
idx22=['sqrt(0.04)', 'sqrt(4)']
but string result:
2.3^2+3^3-sqrt(0.04)*2+sqrt(4)
Why calculating 'idx1' is right, but re.sub don't insert this value into string ?
(sorry for my english:)

Try this using only re.sub()
Input string:
s='2.3^2+3^3-√0.04*2+√4'
Replacing for pow()
s = re.sub("(\d+(?:\.\d+)?)\^(\d+)", "pow(\\1,\\2)", s)
Replacing for sqrt()
s = re.sub("√(\d+(?:\.\d+)?)", "sqrt(\\1)", s)
Output:
pow(2.3,2)+pow(3,3)-sqrt(0.04)*2+sqrt(4)
() means group capture and \\1 means first captured group from regex match. Using this link you can get the detail explanation for the regex.

I've only got python 2.7.5 but this works for me, using str.replace rather than re.sub. Once you've gone to the effort of finding the matches and constructing their replacements, this is a simple find and replace job:
for i in range(len(idx11)):
s = s.replace(idx1[i], idx11[i])
for i in range(len(idx22)):
s = s.replace(idx2[i], idx22[i])
edit
I think you're going about this in quite a long-winded way. You can use re.sub in one go to make these changes:
s = re.sub('(\d+(\.\d+)?)\^(\d+)', r'pow(\1,\3)', s)
Will substitute 2.3^2+3^3 for pow(2.3,2)+pow(3,3) and:
s = re.sub('√(\d+(\.\d+)?)', r'sqrt(\1)', s)
Will substitute √0.04*2+√4 to sqrt(0.04)*2+sqrt(4)
There's a few things going on here that are different. Firstly, \d, which matches a digit, the same as [0-9]. Secondly, the ( ) capture whatever is inside them. In the replacement, you can refer to these captured groups by the order in which they appear. In the pow example I'm using the first and third group that I have captured.
The prefix r before the replacement string means that the string is to be treated as "raw", so characters are interpreted literally. The groups are accessed by \1, \2 etc. but because the backslash \ is an escape character, I would have to escape it each time (\\1, \\2, etc.) without the r.

Related

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?

How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.

You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'

This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]

your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis

You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?

You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))

How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.

This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

repetition in regular expression in python

I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(

You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.

re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.

Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)

Splitting a string with delimiters and conditions

I'm trying to split a general string of chemical reactions delimited by whitespace, +, = where there may be an arbitrary number of whitespaces. This is the general case but I also need it to split conditionally on the parentheses characters () when there is a + found inside the ().
For example:
reaction= 'C5H6 + O = NC4H5 + CO + H'
Should be split such that the result is
splitresult=['C5H6','O','NC4H5','CO','H']
This case seems simple when using filter(None,re.split('[\s+=]',reaction)). But now comes the conditional splitting. Some reactions will have a (+M) which I'd also like to split off of as well leaving only the M. In this case, there will always be a +M inside the parentheses
For example:
reaction='C5H5 + H (+M)= C5H6 (+M)'
splitresult=['C5H5','H','M','C5H6','M']
However, there will be some cases where the parentheses will not be delimiters. In these cases, there will not be a +M but something else that doesn't matter.
For example:
reaction='C5H5 + HO2 = C5H5O(2,4) + OH'
splitresult=['C5H5','HO2','C5H5O(2,4)','OH']
My best guess is to use negative lookahead and lookbehind to match the +M but I'm not sure how to incorporate that into the regex expression I used above on the simple case. My intuition is to use something like filter(None,re.split('[(?<=M)\)\((?=\+)=+\s]',reaction)). Any help is much appreciated.

You could use re.findall() instead:
re.findall(pattern, string, flags=0)
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.
then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction0)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction1)
re.findall('[A-Z0-9]+(?:\([1-9],[1-9]\))?',reaction2)
but, if you prefer re.split() and filter(), then:
import re
reaction0= 'C5H6 + O = NC4H5 + CO + H'
reaction1='C5H5 + H (+M)= C5H6 (+M)'
reaction2='C5H5 + HO2 = C5H5O(2,4) + OH'
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction0))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction1))
filter(None , re.split('(?<!,[1-9])[\s+=()]+(?![1-9,])',reaction2))
the pattern for findall is different from the pattern for split,
because findall and split are looking for different things; 'the opposite things', indeed.
findall, is looking for that you wanna (keep it).
split, is looking for that you don't wanna (get rid of it).
In findall, '[A-Z0-9]+(?:([1-9],[1-9]))?'
match any upper case or number > [A-Z0-9],
one or more times > +, follow by a pair of numbers, with a comma in the middle, inside of parenthesis > \([1-9],[1-9]\)
(literal parenthesis outside of character classes, must be escaped with backslashes '\'), optionally > ?
\([1-9],[1-9]\) is inside of (?: ), and then,
the ? (which make it optional); ( ), instead of (?: ) works, but, in this case, (?: ) is better; (?: ) is a no capturing group: read about this.
try it with the regex in the split

That seems overly complicated to handle with a single regular expression to split the string. It'd be much easier to handle the special case of (+M) separately:
halfway = re.sub("\(\+M\)", "M", reaction)
result = filter(None, re.split('[\s+=]', halfway))

So here is the regex which you are looking for.
Regex: ((?=\(\+)\()|[\s+=]|((?<=M)\))
Flags used:
g for global search. Or use them as per your situation.
Explanation:
((?=\(\+)\() checks for a ( which is present if (+ is present. This covers the first part of your (+M) problem.
((?<=M)\)) checks for a ) which is present if M is preceded by ). This covers the second part of your (+M) problem.
[\s+=] checks for all the remaining whitespaces, + and =. This covers the last part of your problem.
Note: The care for digits being enclosed by () ensured by both positive lookahead and positive lookbehind assertions.
Check Regex101 demo for working
P.S: Make it suitable for yourself as I am not a python programmer yet.

Is this possible using regular expression

I am using Python 2.7 and I am fairly familiar with using regular expressions and how to use them in Python. I would like to use a regex to replace comma delimiters with a semicolon. The problem is that data wrapped in double qoutes should retain embedded commas. Here is an example:
Before:
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After:
"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
Is there a single regex that can do this?

This is an other way that avoids to test all the string until the end with a lookahead for each occurrence. It's a kind of (more or less) \G feature emulation for re module.
Instead of testing what comes after the comma, this pattern find the item before the comma (and the comma obviously) and is written in a way that makes each whole match consecutive to the precedent.
re.sub(r'(?:(?<=,)|^)(?=("(?:"")*(?:[^"]+(?:"")*)*"|[^",]*))\1,', r'\1;', s)
online demo
details:
(?: # ensures that results are contiguous
(?<=,) # preceded by a comma (so, the one of the last result)
| # OR
^ # at the start of the string
)
(?= # (?=(a+))\1 is a way to emulate an atomic group: (?>a+)
( # capture the precedent item in group 1
"(?:"")*(?:[^"]+(?:"")*)*" # an item between quotes
|
[^",]* # an item without quotes
)
) \1 # back-reference for the capture group 1
,
The advantage of this way is that it reduces the number of steps to obtain a match and provides a near from constant number of steps whatever the item before (see the regex101 debugger). The reason is that all characters are matched/tested only once. So even the pattern is more long, it is more efficient (and the gain grow up in particular with long lines)
The atomic group trick is only here to reduce the number of steps before failing for the last item (that is not followed by a comma).
Note that the pattern deals with items between quotes with escaped quotes (two consecutive quotes) inside: "abcd""efgh""ijkl","123""456""789",foo

# Python 2.7
import re
text = '''
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
'''.strip()
print "Before: " + text
print "After: " + ";".join(re.findall(r'(?:"[^"]+"|[^,]+)', text))
This produces the following output:
Before: "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After: "3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
You can tinker with this here if you need more customization.

You can use:
>>> s = 'foo bar,"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"'
>>> print re.sub(r'(?=(([^"]*"){2})*[^"]*$),', ';', s)
foo bar;"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
RegEx Demo
This will match comma only if it is outside quote by matching even number of quotes after ,.

This regex seems to do the job
,(?=(?:[^"]*"[^"]*")*[^"]*\Z)
Adapted from:
How to match something with regex that is not between two special characters?
And tested with http://pythex.org/

You can split with regex and then join it :
>>> ';'.join([i.strip(',') for i in re.split(r'(,?"[^"]*",?)?',s) if i])
'"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"'

Allowing escape sequences in my regular expression

I'm trying to create a regular expression which finds occurences of $VAR or ${VAR}. If something like \$VAR or \${VAR} was given, it would not match. If it were given something like \\$VAR or \\${VAR} or any multiple of 2 \'s, it should match.
i.e.
$BLOB matches
\$BLOB doesn't match
\\$BLOB matches
\\\$BLOB doesn't match
\\\\$BLOB matches
... etc
I'm currently using the following regex:
line = re.sub("[^\\][\\\\]*\$(\w[^-]+)|"
"[^\\][\\\\]*\$\{(\w[^-]+)\}",replace,line)
However, this doesn't work properly. When I give it \$BLOB, it still matches for some reason. Why is this?

The second groupings of double slashes are written as a redundant character class [\\\\]*, matching one or more backslashes, but should be a repeating group ((?:\\\\)*) matching one or more sets of two backslashes:
re.sub(r'(?<!\\)((?:\\\\)*)\$(\w[^-]+|\{(\w[^-]+)\})',r'\1' + replace, line)

To write a regular expression that finds $ unless it is escaped using E unless it in turn is also escaped EE:
import re
values = dict(BLOB='some value')
def repl(m):
return m.group('before') + values[m.group('name').strip('{}')]
regex = r"(?<!E)(?P<before>(?:EE)*)\$(?P<name>N|\{N\})"
regex = regex.replace('E', re.escape('\\'))
regex = regex.replace('N', r'\w+') # name
line = re.sub(regex, repl, line)
Using E instead of '\\\\' exposes your embed language without thinking about backslashes in Python string literals and regular expression patterns.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python. re.findall and re.sub with '^' - python

Related

how to make a list in python from a string and using regular expression [duplicate]

repetition in regular expression in python

Splitting a string with delimiters and conditions

Is this possible using regular expression

Allowing escape sequences in my regular expression

Categories

Resources