I'm trying to define a regex in serveral lines using re.VERBOSE but python is adding a newline symbol. eg
When not using multiline
In [1]: pat = re.compile(r'''(?P<host>(\d{1,3}\.){3}\d{1,3})( - )(?P<user_name>(\w+|-)).''')
...: pat
re.compile(r'(?P<host>(\d{1,3}\.){3}\d{1,3})( - )(?P<user_name>(\w+|-)).',re.UNICODE)
But when trying to define as multiline
In [2]: pat = re.compile(r'''\
...: (?P<host>(\d{1,3}\.){3}\d{1,3})\
...: ( - )(?P<user_name>(\w+|-)).''', re.MULTILINE|re.VERBOSE)
In [4]: pat
re.compile(r'\\n(?P<host>(\d{1,3}\.){3}\d{1,3})\\n( - )(?P<user_name>(\w+|-)).',
re.MULTILINE|re.UNICODE|re.VERBOSE)
I keep getting a \n where the next part of regex is define but it shouldn't.
How am I supouse to define a multiline regex?
There's no inherent problem with having newlines in your regex when you use the re.VERBOSE flag, as whitespace is ignored, with an important caveat:
Whitespace within the pattern is ignored, except when in a character
class, or when preceded by an unescaped backslash
Your first problem is that you are adding an unnecessary \ to the end of each of the lines in your regex, and they are then appearing in the regex, making the newlines preceded by an unescaped backslash and thus required for a match. Consider this trivial example:
pat = re.compile(r'''
\d+
-
\d+''', re.VERBOSE)
pat
# re.compile('\n\\d+\n-\n\\d+', re.VERBOSE) - note newlines in the regex
pat.match('24-34')
# <re.Match object; span=(0, 5), match='24-34'> - but it still matches fine
pat = re.compile(r'''\
\d+\
-\
\d+''', re.VERBOSE)
pat
# re.compile('\\\n\\d+\\\n-\\\n\\d+', re.VERBOSE)
pat.match('24-34')
# nothing
pat.match('\n24\n-\n34')
# <re.Match object; span=(0, 8), match='\n24\n-\n34'> - newlines required to be matched
Your other problem is that your regex is attempting to match whitespace in this capture group:
( - )
To match whitespace when you have the re.VERBOSE flag set, you must follow the rules and escape it or put it in a character class. For example:
pat = re.compile(r'( - )', re.VERBOSE)
pat.match(' - ')
# nothing - the spaces in the regex are ignored
pat.match('-')
# <re.Match object; span=(0, 1), match='-'> - matches just the `-`
pat = re.compile(r'(\ -[ ])', re.VERBOSE) # important whitespace treated appropriately
pat.match(' - ')
# <re.Match object; span=(0, 3), match=' - '> - matches the string because whitespace rules followed
Demo on regex101
Related
I have a long .txt file. I want to find all the matching results with regex.
for example :
test_str = 'ali. veli. ahmet.'
src = re.finditer(r'(\w+\.\s){1,2}', test_str, re.MULTILINE)
print(*src)
this code returns :
<re.Match object; span=(0, 11), match='ali. veli. '>
i need;
['ali. veli', 'veli. ahmet.']
how can i do that with regex?
The (\w+\.\s){1,2} pattern contains a repeated capturing group, and Python re does not store all the captures it finds, it only saves the last one into the group memory buffer. At any rate, you do not need the repeated capturing group because you need to extract multiple occurrences of the pattern from a string, and re.finditer or re.findall will do that for you.
Also, the re.MULTILINE flag is not necessar here since there are no ^ or $ anchors in the pattern.
You may get the expected results using
import re
test_str = 'ali. veli. ahmet.'
src = re.findall(r'(?=\b(\w+\.\s+\w+))', test_str)
print(src)
# => ['ali. veli', 'veli. ahmet']
See the Python demo
The pattern means
(?= - start of a positive lookahead
\b - a word boundary (crucial here, it is necessary to only start capturing at word boundaries)
(\w+\.\s+\w+) - Capturing group 1: 1+ word chars, ., 1+ whitespaces and 1+ word chars
) - end of the lookahead.
I want to extract part of a string in a list which does not have a space followed by number in python.
# INPUT
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
# EXPECTED OUTPUT
output = ['bits', 'scrap', 'bits and pieces', 'junk']
I managed to do this using re.sub or re.split:
output = [re.sub(" [0-9].*", "", t) for t in text]
# OR
output = [re.split(' \d',t)[0] for t in text]
When I tried to use re.search and re.findall, it return me empty list or empty result.
[re.search('(.*) \d', t) for t in text]
#[None, <_sre.SRE_Match object; span=(0, 7), match='scrap 1'>, None, <_sre.SRE_Match object; span=(0, 6), match='junk 3'>]
[re.findall('(.*?) \d', t) for t in text]
#[[], ['scrap'], [], ['junk']]
Can anyone help me with the regex that can return expected output for re.search and re.findall?
You may remove the digit-and-dot substrings at the end of the string only with
import re
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
print([re.sub(r'\s+\d+(?:\.\d+)*$', '', x) for x in text])
# => output = ['bits', 'scrap', 'bits and pieces', 'junk']
See the Python demo
The pattern is
\s+ - 1+ whitespaces (note: if those digits can be "glued" to some other text, replace + (one or more occurrences) with * quantifier (zero or more occurrences))
\d+ - 1 or more digits
(?:\.\d+)* - 0 or more sequences of
\. - a dot
\d+ - 1 or more digits
$ - end of string.
See the regex demo.
To do the same with re.findall, you can use
# To get 'abc 5.6 def' (not 'abc') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d[\d.]*)?$', x) #
# To get 'abc' (not 'abc 5.6 def') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d.*)?$', x) #
See this regex demo.
However, this regex is not efficient enough due to the .*? construct. Here,
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars (use re.DOTALL to match all) as few as possible (so that the next optional group could be tested at each position)
(?: \d[\d.]*)? -an optional non-capturing group matching
- a space
\d - a digit
[\d.]* - zero or more digits or . chars
(OR) .* - any 0+ chars other than line break chars, as many as possible
$ - end of string.
I am writing a Python script to find a tag name in a string like this:
string='Tag Name =LIC100 State =TRUE'
If a use a expression like this
re.search('Name(.*)State',string)
I get " =LIC100". I would like to get just LIC100.
Any suggestions on how to set up the pattern to eliminate the whitespace and the equal signal?
That is because you get 0+ chars other than line break chars from Name up to the last State. You may restrict the pattern in Group 1 to just non-whitespaces:
import re
string='Tag Name =LIC100 State =TRUE'
m = re.search(r'Name\s*=(\S*)',string)
if m:
print(m.group(1))
See the Python demo
Pattern details:
Name - a literal char sequence
\s* - 0+ whitespaces
= - a literal =
(\S*) - Group 1 capturing 0+ chars other than whitespace (or \S+ can be used to match 1 or more chars other than whitespace).
The easiest solution would probably just be to strip it out after the fact, like so:
s = " =LIC100 "
s = s.strip('= ')
print(s)
#LIC100
If you insist on doing it within the regex, you can try something like:
reg = r'Name[ =]+([A-Za-z0-9]+)\s+State'
Your current regex is failing because (.*) captures all characters until the occurance of State. Instead of capturing everything, you can use a positive lookbehind to describe what preceeds, but is not included in, the content you actually want to capture. In this case, "Name =" preceeds the match, so we can stick it in the lookbehind assertion as (?<=Name =), then proceed to capture everything until the next whitespace:
>>> import re
>>> s = 'Tag Name =LIC100 State =TRUE'
>>> r = re.compile("(?<=Name =)\w*")
>>> print(r.search(s))
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>
>>> print(r.search(s).group(0))
LIC100
Following the tips above, I manage to find a nice solution.
Actually, the string I am trying to process has some non-printable characters. It is like this
"Tag Name\x00=LIC100\x00\tState=TRUE"
Using the concept of lookahead and lookbehind I found the following solution:
import re
s = 'Tag Name\x00=LIC100\x00\tState=TRUE'
T=re.search(r'(?<=Name\x00=)(.*)(?=\x00\tState)',s)
print(T.group(0))
The nice thing about this is that the outcome does not have any non-printable character on it.
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>
Having this multiline variable:
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
The structure is always TAG = CONTENT, both strings are NOT fixed and CONTENT could contain new lines.
I need a regex to get:
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\n'), ('PARALLEL', '4')]
Tried multiple combinations but I'm not able to stop the regex engine at the right point for TABLES tag as its content is a multiline string delimited by the next tag.
Some attempts from the interpreter:
>>> re.findall(r'(\w+?)\s=\s(.+?)', raw, re.DOTALL)
[('CONTENT', 'A'), ('TABLES', 'T'), ('PARALLEL', '4')]
>>> re.findall(r'^(\w+)\s=\s(.+)?', raw, re.M)
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1'), ('PARALLEL', '4')]
>>> re.findall(r'(\w+)\s=\s(.+)?', raw, re.DOTALL)
[('CONTENT', 'ALL\nTABLES = TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\nPARALLEL = 4\n')]
Thanks!
You can use a positive lookahead to make sure you lazily match the value correctly:
(\w+)\s=\s(.+?)(?=$|\n[A-Z])
^^^^^^^^^^^^
To be used with a DOTALL modifier so that a . could match a newline symbol. The (?=$|\n[A-Z]) lookahead will require .+? to match up to the end of string, or up to the newline followed with an uppercase letter.
See the regex demo.
And alternative, faster regex (as it is an unrolled version of the expression above) - but DOTALL modifier should NOT be used with it:
(\w+)\s*=\s*(.*(?:\n(?![A-Z]).*)*)
See another regex demo
Explanation:
(\w+) - Group 1 capturing 1+ word chars
\s*=\s* - a = symbol wrapped with optional (0+) whitespaces
(.*(?:\n(?![A-Z]).*)*) - Group 2 capturing 0+ sequences of:
.* - any 0+ characters other than a newline
(?:\n(?![A-Z]).*)* - 0+ sequences of:
\n(?![A-Z]) - a newline symbol not followed with an uppercase ASCII letter
.* - any 0+ characters other than a newline
Python demo:
import re
p = re.compile(r'(\w+)\s=\s(.+?)(?=$|\n[A-Z])', re.DOTALL)
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
print(p.findall(raw))
Why this string matches the pattern ?
pattern = """
^Page \d of \d$|
^Group \d Notes$|
^More word lists and tips at http://wwwmajortests.com/word-lists$|
"""
re.match(pattern, "stackoverflow", re.VERBOSE)
According to me it should match strings like "Page 1 of 1" or "Group 1 Notes".
In your regular expression, there's trailing |:
# ^More word lists and tips at http://wwwmajortests.com/word-lists$|
# ^
Empty pattern matches any string:
>>> import re
>>> re.match('abc|', 'abc')
<_sre.SRE_Match object at 0x7fc63f3ff3d8>
>>> re.match('abc|', 'bbbb')
<_sre.SRE_Match object at 0x7fc63f3ff440>
So, Remove the trailing |.
BTW, you don't need ^ becasue re.match checks for a match only at the beginning of the string.
And, I recommend you to use raw strings(r'....') to correctly escape backslahes.
ADDITIONAL NOTE
\d matches only a single digit. Use \d+ if you also want to match multiple digits.