Regular expression: how to match a string containing "\n" (newline)? - python

I'm trying to dump data from a SQL export file with regular expression. To match the field of post content, I use '(?P<content>.*?)'. It works fine most of the time, but if the field contains the string of '\n' the regular expression wouldn't match. How can I modify the regular expression to match them? Thanks!
Example(I'm using Python):
>>> re.findall("'(?P<content>.*?)'","'<p>something, something else</p>'")
['<p>something, something else</p>']
>>> re.findall("'(?P<content>.*?)'","'<p>something, \n something else</p>'")
[]
P.S. Seemingly all strings with '\' in the front are treated as escape characters. How can I tell regx to treat them as they are?

You should use DOTALL option:
>>> re.findall("'(?P<content>.*?)'","'<p>something, \n something else</p>'", re.DOTALL)
['<p>something, \n something else</p>']
See this.

You need the Dotall modifier, to make the dot also match newline characters.
re.S
re.DOTALL
Make the '.' special character match any character at
all, including a newline; without this flag, '.' will match anything
except a newline.
See it here on docs.python.org

Related

Regular expressions issue with python due to values with brackets

I have a very large string (300 MB+), and it has some garbage data in it that I need to clean up. I am using Python 2.7 32-bit.
I didn't want to use the string operation replace because the file the user uses is only going to grow over time, so I am trying to use re.sub to replace the value of [linender] with a new line character like \n or os.linesep.
It seems simple enough to do, so my pattern is:
re.sub('\[lineender]\b, os.linesep, text_value)
This results in only one value being replaced in the whole string, which is wrong.
Sample Data:
s = """A|B|3[lineender]E|F|2M[lineender]"""
Any ideas on how I need to modify my regex to get this working?
I basically need to replace the bracket word with a new line character.
Note that \b in a non-raw string literal is a backspace. If you use a word boundary r'\b', it will require a word char (a letter, digit or an underscore) after ]. In your case, I'd remove \b altogether:
re.sub(r'\[lineender]', os.linesep, text_value)
If you want to make sure there is no word char after ], you may replace \b with \B, but please make sure you are using the r prefix to make your string literal raw.
See Python demo:
import re, os
text_value = """A|B|3[lineender]E|F|2M[lineender]"""
print('"{}"'.format(re.sub(r'\[lineender]', os.linesep, text_value)))
You need to pass the pattern as a raw string:
re.sub(r'\[lineender\]\b', os.linesep, text_value)
alternatively, you'll have to use \\ (double backslashes):
re.sub('\\[lineender\\]\\b', os.linesep, text_value)

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

C Preprocessing with Python Regular Expressions

I've never used regular expressions before and I'm struggling to make sense of them. I have strings in the form of 'define(__arch64__)' and I just want the __arch64__.
import re
mystring = 'define(this_symbol)||define(that_symbol)'
pattern = 'define\(([a-zA-Z_]\w*)\)'
re.search(mystring, pattern).groups()
(None, None)
What doesn't search return 'this_symbol' and 'that_symbol'?
You have the parameters of search() in the wrong order, it should be:
re.search(pattern, mystring)
Also, backslashes are escape characters in python strings (for example "\n" will be a string containing a newline). If you want literal backslaches, like in the regular expression, you have to escape them with another backslash. Alternatively you can use raw strings that are marked by an r in front of them and don't treat backslashes as escape characters:
pattern = r'define\(([a-zA-Z_]\w*)\)'
You must differentiate between the symbol ( and the regexp group characters. Also, the pattern goes first in re.search:
pattern = 'define\\(([a-zA-Z_]\w*)\\)'
re.search(pattern, mystring).groups()

Need regular expression expert: round bracket within stringliteral

I'm searching for strings within strings using Regex. The pattern is a string literal that ends in (, e.g.
# pattern
" before the bracket ("
# string
this text is before the bracket (and this text is inside) and this text is after the bracket
I know the pattern will work if I escape the character with a backslash, i.e.:
# pattern
" before the bracket \\("
But the pattern strings are coming from another search and I can not control what characters will be or where. Is there a way of escaping an entire string literal so that anything between markers is treated as a string? For example:
# pattern
\" before the ("
The only other option I have is to do a substitute adding escapes for every protected character.
re.escape is exactly what I need. I'm using regexp in Access VBA which doens't have that method. I only have replace, execute or test methods.
Is there a way to escape everything within a string in VBA?
Thanks
You didn't specify the language, but it looks like Python, so if you have a string in Python whose special regex characters you need to escape, use re.escape():
>>> import re
>>> re.escape("Wow. This (really) is *cool*")
'Wow\\.\\ This\\ \\(really\\)\\ is\\ \\*cool\\*'
Note that spaces are escaped, too (probably to ensure that they still work in a re.VERBOSE regex).
Maybe write your own VBA escape function:
Function EscapeRegEx(text As String) As String
Dim regEx As RegExp
Set regEx = New RegExp
regEx.Global = True
regEx.Pattern = "(\[|\\|\^|\$|\.|\||\?|\*|\+|\(|\)|\{|\})"
EscapeRegEx = regEx.Replace(text, "\$1")
End Function
I'm pretty sure that with the limitations of the RegExp abilities in VBA/VBScript, you are going to have to replace the special characters in your pattern before using it. There doesn't seem to be anything built into it like there is in Python.
The following regex will capture everything from the beginning of the string to the first (. The first captured group $1 will contain the portion before (.
^([^(]+)\(
Depending on your language, you might have to escape it as:
"^([^(]+)\\("

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources