Parse a string from a string with regex - python

I need a regex that will parse a string from a string.
To show you what I mean, imagine that the following is the content of the string to parse:
"a string" ... \\"another \"string\"\\" ... "yet another \"string" ... "failed string\"
where "..." denotes some arbitrary data.
The regex would need to return the list:
["a string", "another \"string\"\\", "yet another \"string"]
Edit: Note that the literal backslashes don't stop the second match
I've tried finditer but it won't find overlapping matches, and I tried the lookahead (?=) but I couldn't get that to work either.
Help?

You could try the below regex to match the strings that starts with " (which was not preceded by \ symbol) upto the next " symbol which also not preceded by \
(?<!\\)".*?(?<!\\)"
DEMO
>>> s = r'"a string" ... "another \"string\"" ... "yet another \"string" ... "failed string\"'
>>> m = re.findall(r'".*?[^\\]"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']
>>> m = re.findall(r'".*?(?<!\\)"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']
>>> m = re.findall(r'(?<!\\)".*?(?<!\\)"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']
UPDATE:
>>> s = r'"a string" ... \\"another \"string\"\\" ... "yet another \"string" ... "failed string\" '
>>> m = re.findall(r'(?<!\\)".*?(?<!\\)"|(?<=\\\\)".*?\\\\"', s)
>>> m
['"a string"', '"another \\"string\\"\\\\"', '"yet another \\"string"']
>>> for i in m:
... print i
...
"a string"
"another \"string\"\\"
"yet another \"string"
DEMO

A way that emulate an atomic group (that is interesting to reduce the backtracking when the pattern must fail):
re.findall(r'"(?=((?:[^"\\]+|\\.)*))\1"', s)
demo

You can use this regex:
"[\w\s\\"]+(?<!\\)"
Working demo
Edit: I noticed you updated your input sample. For the updated input, you can use this regex:
(?:\\\\"|")[\w\s\\"]+(?:\\\\"|(?<!\\)")
Working demo

("[^...]*?")(?=\s*\.\.\.|$)
You can try this.
See demo.Works correctly to give the required answer.
http://regex101.com/r/bJ6rZ5/4

Related

Python re regex sub letters not surrounded in quotes and not if they match specific word including regex group / match

I need to sub letters not surrounded in quotes and not if they match the word TODAY with a particular string where a part of it includes the match group e.g.
import re
import string
s = 'AB+B+" HELLO"+TODAY()/C* 100'
x = re.sub(r'\"[^"]*\"|\bTODAY\b|([A-Z]+)', r'a2num("\g<0>")', s)
print (x)
expected output:
'a2num("AB")+a2num("B")+" HELLO"+TODAY()/a2num("C")* 100'
actual output:
'a2num("AB")+a2num("B")+a2num("" HELLO"")+a2num("TODAY")()/a2num("C")* 100'
I am nearly there but it is not obeying the quote rules or the TODAY word rule, I know the string doesn't make any sense but it's just a harsh test of the regex
Your regex approach is correct but you need to use a lambda function in re.sub
>>> s = 'AB+B+" HELLO"+TODAY()/C* 100'
>>> rs = re.sub(r'"[^"]*"|\bTODAY\b|\b([A-Z]+)\b',
... lambda m: 'a2num("' + m.group(1) + '")' if m.group(1) else m.group(), s)
>>> print (rs)
a2num("AB")+a2num("B")+" HELLO"+TODAY()/a2num("C")* 100
Code Demo

Python re.search regex using or condition

Hi Friends I'm trying to include "or |" condition in search pattern using re.search. Can someone help me how to achieve or condition as I'm not getting match.
Below code works
>>> pattern = re.escape('apple.fruit[0]')
>>> sig = 'apple.fruit[0]'
>>> if re.search(pattern, sig):
... print("matched")
...
matched
>>> pattern = re.escape('apple.fruit[0] or vegi[0]')
>>> if re.search(pattern, sig):
... print("matched")
...
>>>
I want to match above string "apple." followed by fruit[0] or vegi[0]
Regex or should be achieved through | operator and we don't inculde this inside re.escape. If you do so, then it would loose it's special meaning.
pattern = re.escape('apple.fruit[0]')+ '|' + re.escape('vegi[0]')
or
pattern = r'apple\.fruit\[0\]|vegi\[0\]'

Why can’t I get rid of the L with this python regular expression?

I’m trying to get rid of the Ls at the ends of integers with a regular expression in python:
import re
s = '3535L sadf ddsf df 23L 2323L'
s = re.sub(r'\w(\d+)L\w', '\1', s)
However, this regex doesn't even change the string. I've also tried s = re.sub(r'\w\d+(L)\w', '', s) since I thought that maybe the L could be captured and deleted, but that didn't work either.
I'm not sure what you're trying to do with those \ws in the first place, but to match a string of digits followed by an L, just use \d+L, and to remove the L you just need to put the \d+ part in a capture group so you can sub it for the whole thing:
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> re.sub(r'(\d+)L', r'\1', s)
'3535 sadf ddsf df 23 2323'
Here's the regex in action:
(\d+)L
Debuggex Demo
Of course this will also convert, e.g., 123LBQ into 123BQ, but I don't see anything in your examples or in your description of the problem that indicates that this is possible, or which possible result you want for that, so…
\w = [a-zA-Z0-9_]
In other words, \w does not include whitespace characters. Each L is at the end of the word and therefore doesn't have any "word characters" following it. Perhaps you were looking for word boundaries?
re.sub(r'\b(\d+)L\b', '\1', s)
Demo
You can use look behind assertion
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> s = re.sub(r'\w(?<=\d)L\b', '', s)
>>> s
'353 sadf ddsf df 2 232'
(?<=\d)L asserts that the L is presceded by a digit, in which case replace it with null''
Try this:
re.sub(r'(?<=\d)L', '\1', s)
This uses a lookbehind to find a digit followed by an "L".
Why not use a - IMO more readable - generator expression?
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> ' '.join(x.rstrip('L') if x[-1:] =='L' and x[:-1].isdigit() else x for x in s.split())
'3535 sadf ddsf df 23 2323'

Splitting a string by using two substrings in Python

I am searching a string by using re, which works quite right for almost all cases except when there is a newline character(\n)
For instance if string is defined as:
testStr = " Test to see\n\nThis one print\n "
Then searching like this re.search('Test(.*)print', testStr) does not return anything.
What is the problem here? How can I fix it?
The re module has re.DOTALL to indicate "." should also match newlines. Normally "." matches anything except a newline.
re.search('Test(.*)print', testStr, re.DOTALL)
Alternatively:
re.search('Test((?:.|\n)*)print', testStr)
# (?:…) is a non-matching group to apply *
Example:
>>> testStr = " Test to see\n\nThis one print\n "
>>> m = re.search('Test(.*)print', testStr, re.DOTALL)
>>> print m
<_sre.SRE_Match object at 0x1706300>
>>> m.group(1)
' to see\n\nThis one '

Period stops multiline regex substitute in Python?

I have multiple line string that I'd like to replace, but don't understand why it's not working. For some reason, a period in the string stops the matching for the regular expression.
My string:
s = """
[some_previous_text]
<start>
one_period .
<end>
[some_text_after]
"""
What I'd like to end up with:
s = """
[some_previous_text]
foo
[some_text_after]
"""
What I initially tried, but it doesn't match anything:
>>> import re
>>> s = "<start>\none_period .\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
<start>
one_period .
<end>
However, when I took the period out, it worked fine:
>>> import re
>>> s = "<start>\nno_period\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
foo
Also, when I put an <end> tag before the period, it matched the first <end> tag:
>>> import re
>>> s = "<start>\n<end>\none_period .\n<end>"
>>> print re.sub("<start>[^.]*<end>", "foo", s)
foo
one_period .
<end>
So what's going on here? Why does the period stop the [^.]* matching?
EDIT:
SOLVED
I mistakenly thought that the carat ^ was for new-line matching. What I needed was a re.DOTALL flag (as indicated by Amber). Here's the expression I'm now using:
>>> import re
>>> s = "<start>\none_period .\n<end>"
>>> print re.sub("<start>.*<end>", "foo", s, flags=re.DOTALL)
foo
Why wouldn't it? [^.] is "the set of all characters that is not a ." and thus doesn't match periods.
Perhaps you instead meant to just put .* (any number of any characters) instead of [^.]*?
For matching across newlines, specify re.DOTALL:
re.sub("<start>.*<end>", "foo", s, flags=re.DOTALL)
Thats because [^.]* is a negated character class that matches any character but a period.
You probably want something like <start>.*?<end> together with the re.S modifier, that makes the dot matches also newline characters.
re.sub("<start>.*?<end>", "foo", s, flags=re.S)

Categories

Resources