What does this regex pattern match? - python

regex = re.compile(r"\s*[-*+]\s*(.+)")
Especially this part: \s*[-*+]
I want to match this string:
[John](person)is good and [Mary](person) is good too.
But it fails.
Does the \s*[-*+] mean the following:
matches an optional space, followed by one of the characters: -, *, +
This is in Python.

Pattern \s*[-*+]\s*(.+) means:
\s* - match zero or more whitesapces
[-*+] - match one characters from the set: - or * or +
(.+) - match one or more of any characters and store it inside capturing group (. means any character and brackets denote capturing group)
In your sentence, pattern won't match anything due to lack of any of characters from the set -*+.
It would match, for example * (person) is good too. in
[John](person)is good and [Mary] * (person) is good too.
Demo
In order to match names and their description in brackets use \[([^\]]+)\]\(([^)]+)
Explanation:
\[ - match [ literally
([^\]]+) - match one or more characters other from ] and store it in first captuirng group
\] - match [ literally
\( - match ( literally
([^)]+) - match one or more characters other from )
Demo

Related

Get words in parenthesis as a group regex

String1: {{word1|word2|word3 (word4 word5)|word6}}
String2: {{word1|word2|word3|word6}}
With this regex sentence:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?=\}\})
I capture String2 as groups. How can I change the regex sentence to capture (word4 word5) also as a group?
You can add a (?:\s*(\([^()]*\)))? subpattern:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\([^()]*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See the regex demo.
The (?:\s*(\([^()]*\)))? part is an optional non-capturing group that matches one or zero occurrences of
\s* - zero or more whitespaces
( - start of a capturing group:
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char
) - end of the group.
If you need to make sure only whitespace separated words are allowed inside parentheses, replace [^()]* with \w+(?:\s+\w+)* and insert (?:\s*(\(\w+(?:\s+\w+)*\)))?:
(?<=\{\{)(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)\|(\w+(?:\s+\w+)*)(?:\s*(\(\w+(?:\s+\w+)*\)))?\|(\w+(?:\s+\w+)*)(?=\}\})
See this regex demo.
You could simplify the expression by matching the desired substrings rather than capturing them. For that you could use the following regular expression.
(?<=[{| ])\w+(?=[}| ])|\([\w ]+\)
Regex demo <¯\(ツ)/¯> Python demo
The elements of the expression are as follows.
(?<= # begin a positive lookbehind
[{| ] # match one of the indicated characters
) # end the positive lookbehind
\w+ # match one or more word characters
(?= # begin a positive lookahead
[}| ] # match one of the indicated characters
) # end positive lookahead
| # or
\( # match character
[\w ]+ # match one or more of the indicated characters
\) # match character
Note that this does not validate the format of the string.

Regex include only one digit between chars

I have to parse a PDF document and I'm using PyPDF2 with re(regex).
The file includes several lines like the one below:
18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40
I need to extract from this line the text( bold ) between the time and the amount:
PEDMILANO OVEST- BINASCOA
The following code is working but sometimes this code doesn't find anything since can be a number between these chars, for example, 18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40.
regex = re.compile(r'\d\d-\d\d-\d\d\d\d\d\d:\d\d:\d\d\D+\d+,\d\d')
Is there a way to include a number in this regular expression?
The following should simplify the current regex:
import re
s = '18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40'
re.search(r'\:\d+([A-Z].*?)(?=\d+\,\d+$)', s).group(1)
# 'PEDMILANO OVE3ST- BINASCOA'
See demo
\d+([A-Z].*?)(?=\d+\,\d+$)
\: matches the character : literally (case sensitive)
\d+: matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([A-Z].*?)
Match a single character present in the list below [A-Z]
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=\d+\,\d+$)
Assert that the Regex below matches
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
I suggest using
import re
text = "18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40"
print( re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', r'\1', text) )
It can also be written as
re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}|\d+(?:,\d+)?$', '', text)
Or, if you prefer matching and capturing:
m = re.search(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', text)
if m:
print( m.group(1) )
See an online Python demo. With this solution, your data may start with any char, and will contain any char (excluding line break chars, since your data is on single lines).
Regex details
^ - start of string
\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2} - datetime string: two digits, -, two digits, -, five or six digits, :, two digits, : two digits
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\d+(?:,\d+)? - an int/float value pattern: 1+ digits followed with an optional sequence of , and 1+ digits
$ - end of string.
See the regex demo.

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

regex pattern not matching continuous groups

I am trying the following pattern :
[,;\" ](.+?\/.+?)[\",; ]
in the following string:
['"text/html,application/xhtml+xml,application/xml;q=0.9;q
=0.8"']
It matches the bold text but not the italic one. Why?
I want to extract text/html, application/xhtml+xml and application/xml. It is extracting 1st and 3rd but not the middle one
Your last [,"; ] consumes the , after text/html and thus, at the next iteration, when the regex engine searches for a match, the first [,;" ] cannot match that comma. Hence, you lose one match.
You may turn the trailing [,"; ] into a non-consuming pattern, a positive lookahead, or better, since the matches cannot contain the delimiters, use a negated character class approach:
[,;" ]([^/,;" ]+/[^/,;" ]+)
See the regex demo. If there can be more than 1 / inside the expected matches, remove / char from the second character class.
Details
[,;" ] - a comma, ;, ", or space
([^/,;" ]+/[^/,;" ]+) - Group 1: any one or more chars that is not /, ,. ;, " and space, / and then again any one or more chars that is not /, ,. ;, " and space as many as possible
Python demo:
import re
rx = r'[,;" ]([^/,;" ]+/[^/,;" ]+)'
s = """['"text/html,application/xhtml+xml,application/xml;q=0.9;q =0.8"']"""
res = re.findall(rx, s)
print(res) # => ['text/html', 'application/xhtml+xml', 'application/xml']

Python regex matching all but last occurrence

So I have expression such as "./folder/thisisa.test/file.cxx.h" How do I substitute/remove all the "." but the last dot?
To match all but the last dot with a regex:
'\.(?=[^.]*\.)'
Using a lookahead to check that's there another dot after the one we found (the lookahead's not part of the match).
Without regular expressions, using str.count and str.replace:
s = "./folder/thisisa.test/file.cxx.h"
s.replace('.', '', s.count('.')-1)
# '/folder/thisisatest/filecxx.h'
Specific one-char solution
In your current scenario, you may use
text = re.sub(r'\.(?![^.]*$)', '', text)
Here, \.(?![^.]*$) matches a . (with \.) that is not immediately followed ((?!...)) with any 0+ chars other than . (see [^.]*) up to the end of the string ($).
See the regex demo and the Python demo.
Generic solution for 1+ chars
In case you want to replace a . and any more chars you may use a capturing group around a character class with the chars you need to match and add the positive lookahead with .* and a backreference to the captured value.
Say, you need to remove the last occurrence of [, ], ^, \, /, - or . you may use
([][^\\./-])(?=.*\1)
See the regex demo.
Details
([][^\\./-]) - a capturing group matching ], [, ^, \, ., /, - (note the order of these chars is important: - must be at the end, ] must be at the start, ^ should not be at the start and \ must be escaped)
(?=.*\1) - a positive lookahead that requires any 0+ chars as many as possible and then the value captured in Group 1.
Python sample code:
import re
text = r"./[\folder]/this-is-a.test/fi^le.cxx.LAST[]^\/-.h"
text = re.sub(r'([][^\\./-])(?=.*\1)', '', text, flags=re.S)
print(text)
Mind the r prefix with string literals. Note that flags=re.S will make . match any linebreak sequences.

Categories

Resources