python regular_expression_ multiple expression within expression - python

In my python script it needed a expression like
"\[.*[ERROR].*\n.*\n.*\n.*/\n.*is for multiple time/[\]]{2}"
please let me know how to take "\n." for multiple time... I'm getting stuck in this place

There is the multiline flag available, that let's you match across multiple lines.
https://docs.python.org/2/library/re.html#re.MULTILINE
re.MULTILINE
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
You also have access to DOTALL that will have . match even newlines
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Depending on your match, those two flags let you choose how newlines are handled. In your case, you probably want to adjust your pattern like this:
text = '\n[ [ERROR]\n\nsome text\nis for multiple time]'
re.findall("\[.*\[ERROR\].*is for multiple time\]", text, re.DOTALL)
# result: ['[ [ERROR]\n\nsome text\nis for multiple time]']

Related

How to remove a word if it has more than 2 occurrence of a given character in python?

I am parsing a log file which has lines like:
Pushing the logs into /var/log/my_log.txt
Pushing the logs into /opt/test/log_file.txt
There are multiple occurrences of these lines with auto-generated paths(/.../.../...)
I want to change this into a generic form like:
Pushing the logs into PATH
I tried using regex to select a word with multiple forward slashes and then replace it with the word 'PATH' as follows:
line = re.sub(r'\b([\/A-Z]*\/[A-Z]*){1,}\b',' PATH ',line)
Only the forward slashes are getting replaced but not the entire word.
Very new to this concept. Am I doing something wrong? All help is appreciated. Thanks.
You could use:
import re
line = 'Pushing the logs into /var/log/my_log.txt'
pat = r'(?<!\S)(/\S+){2,}'
line = re.sub(pat, 'PATH', line)
print(line)
This is not answering exactly as stated because it looks for "words" that must start with a / and also contain two or more / (with other non-whitespace characters following each /) -- so it would cover e.g. /tmp/my_log.txt. I think this better covers the sort of strings that you would find -- if they are absolute paths then / will always be the first character, and similarly if they are files rather than directories then the last / will not be at the end (although I haven't bothered to exclude a / at the end provided that there are also at least two before it). If you only want to look for e.g. 3 or more / (not at the end), then change the 2 to a 3, but you will miss /tmp/my_log.txt if you do that.
The first bit of the regexp (?<!\S) is a negative lookbehind assertion meaning "not preceded by a non-whitespace character", i.e. it will match at the start of a "word" or the start of the line. The next bit (/\S+) means a / followed by one or more non-whitespace characters (which could include / -- it doesn't matter so I haven't bothered to exclude these). And the {2,} means that there should be two or more of these.
(I am using "word" here as in the question, to refer to sequence of non-whitespace characters, not necessarily letters.)
Only the forward slashes are matched because the string is lower case, and the pattern matches zero or more times either a forward slash or uppercase char A-Z using [\/A-Z]*
You could make the pattern case insensitive using re.IGNORECASE but it will not match the underscore and the dot in the example data.
The first forward slash does not get matched as you start the pattern with a word boundary \b, but there is no word boundary between the space and the first forward slash.
A bit more specific match could be using \w to match a word character and specify the dot for the extension:
(?<!\S)(?:/\w+)+/\w+\.\w+(?!\S)
(?<!\S) Assert a whitespace boundary to the left
(?:/\w+)+ Match 1 or more times a / followed by 1+ word chars
/\w+\.\w+ Match the last / followed by a filename format using the dot and word chars
(?!\S) Assert a whitespace boundary to the right
See a regex demo | Python demo
import re
line = 'Pushing the logs into /var/log/my_log.txt'
line = re.sub(r'(?<!\S)(?:/\w+)+/\w+\.\w+(?!\S)', 'PATH', line)
print(line)
Output
Pushing the logs into PATH
A broader pattern could be matching 2 times the forward slash and use a negated character class to match any char except a forward slash or a newline
(?<!\S)(?:/[^/\r\n]+){2,}
See another regex demo

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Python regex with *?

What does this Python regex match?
.*?[^\\]\n
I'm confused about why the . is followed by both * and ?.
* means "match the previous element as many times as possible (zero or more times)".
*? means "match the previous element as few times as possible (zero or more times)".
The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).
If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:
>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']
So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.
This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.
. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.
* indicates that you can have 0 or more of the thing preceding it.
? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.
Opening the Python re module documentation, and searching for *?, we find:
*?, +?, ??:
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.

python regular expression matching anything

My regular expression isnt doing anything to my string.
python
data = 'random\n<article stuff\n</article>random stuff'
datareg = re.sub(r'.*<article(.*)</article>.*', r'<article\1</article>', data, flags=re.MULTILINE)
print datareg
i get
random
<article stuff
</article>random stuff
i want
<article stuff
</article>
re.MULTILINE doesn't actually make your regex multiline in the way you want it to be.
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
re.DOTALL does:
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Change flags=re.MULTILINE to flags=re.DOTALL and your regex will work.

match part of a string until it reaches the end of the line (python regex)

If I have a large string with multiple lines and I want to match part of a line only to end of that line, what is the best way to do that?
So, for example I have something like this and I want it to stop matching when it reaches the new line character.
r"(?P<name>[A-Za-z\s.]+)"
I saw this in a previous answer:
$ - indicates matching to the end of the string, or end of a line if
multiline is enabled.
My question is then how do you "enable multiline" as the author of that answer states?
Simply use
r"(?P<name>[A-Za-z\t .]+)"
This will match ASCII letters, spaces, tabs or periods. It'll stop at the first character that's not included in the group - and newlines aren't (whereas they are included in \s, and because of that it's irrelevant whether multiline mode is turned on or off).
You can enable multiline matching by passing re.MULTILINE as the second argument to re.compile(). However, there is a subtlety to watch out for: since the + quantifier is greedy, this regular expression will match as long a string as possible, so if the next line is made up of letters and whitespace, the regex might match more than one line ($ matches the end of any string).
There are three solutions to this:
Change your regex so that, instead of matching any whitespace including newline (\s) your repeated character set does not match that newline.
Change the quantifier to +?, the non-greedy ("minimal") version of +, so that it will match as short a string as possible and therefore stop at the first newline.
Change your code to first split the text up into an individual string for each line (using text.split('\n').
Look at the flags parameter at http://docs.python.org/library/re.html#module-contents

Categories

Resources