regex pattern not matching continuous groups - python

I am trying the following pattern :
[,;\" ](.+?\/.+?)[\",; ]
in the following string:
['"text/html,application/xhtml+xml,application/xml;q=0.9;q
=0.8"']
It matches the bold text but not the italic one. Why?
I want to extract text/html, application/xhtml+xml and application/xml. It is extracting 1st and 3rd but not the middle one

Your last [,"; ] consumes the , after text/html and thus, at the next iteration, when the regex engine searches for a match, the first [,;" ] cannot match that comma. Hence, you lose one match.
You may turn the trailing [,"; ] into a non-consuming pattern, a positive lookahead, or better, since the matches cannot contain the delimiters, use a negated character class approach:
[,;" ]([^/,;" ]+/[^/,;" ]+)
See the regex demo. If there can be more than 1 / inside the expected matches, remove / char from the second character class.
Details
[,;" ] - a comma, ;, ", or space
([^/,;" ]+/[^/,;" ]+) - Group 1: any one or more chars that is not /, ,. ;, " and space, / and then again any one or more chars that is not /, ,. ;, " and space as many as possible
Python demo:
import re
rx = r'[,;" ]([^/,;" ]+/[^/,;" ]+)'
s = """['"text/html,application/xhtml+xml,application/xml;q=0.9;q =0.8"']"""
res = re.findall(rx, s)
print(res) # => ['text/html', 'application/xhtml+xml', 'application/xml']

Related

How to split a string with parentheses and spaces into a list

I want to split strings like:
(so) what (are you trying to say)
what (do you mean)
Into lists like:
[(so), what, (are you trying to say)]
[what, (do you mean)]
The code that I tried is below. In the site regexr, the regex expression match the parts that I want but gives a warning, so... I'm not a expert in regex, I don't know what I'm doing wrong.
import re
string = "(so) what (are you trying to say)?"
rx = re.compile(r"((\([\w \w]*\)|[\w]*))")
print(re.split(rx, string ))
Using [\w \w]* is the same as [\w ]* and also matches an empty string.
Instead of using split, you can use re.findall without any capture groups and write the pattern like:
\(\w+(?:[^\S\n]+\w+)*\)|\w+
\( Match (
\w+ Match 1+ word chars
(?:[^\S\n]+\w+)* Optionally repeat matching spaces and 1+ word chars
\) Match )
| Or
\w+ Match 1+ word chars
Regex demo
import re
string = "(so) what (are you trying to say)? what (do you mean)"
rx = re.compile(r"\(\w+(?:[^\S\n]+\w+)*\)|\w+")
print(re.findall(rx, string))
Output
['(so)', 'what', '(are you trying to say)', 'what', '(do you mean)']
For your two examples you can write:
re.split(r'(?<=\)) +| +(?=\()', str)
Python regex<¯\(ツ)/¯>Python code
This does not work, however, for string defined in the OP's code, which contains a question mark, which is contrary to the statement of the question in terms of the two examples.
The regular expression can be broken down as follows.
(?<=\)) # positive lookbehind asserts that location in the
# string is preceded by ')'
[ ]+ # match one or more spaces
| # or
[ ]+ # match one or more spaces
(?=\() # positive lookahead asserts that location in the
# string is followed by '('
In the above I've put each of two space characters in a character class merely to make it visible.

What does this regex pattern match?

regex = re.compile(r"\s*[-*+]\s*(.+)")
Especially this part: \s*[-*+]
I want to match this string:
[John](person)is good and [Mary](person) is good too.
But it fails.
Does the \s*[-*+] mean the following:
matches an optional space, followed by one of the characters: -, *, +
This is in Python.
Pattern \s*[-*+]\s*(.+) means:
\s* - match zero or more whitesapces
[-*+] - match one characters from the set: - or * or +
(.+) - match one or more of any characters and store it inside capturing group (. means any character and brackets denote capturing group)
In your sentence, pattern won't match anything due to lack of any of characters from the set -*+.
It would match, for example * (person) is good too. in
[John](person)is good and [Mary] * (person) is good too.
Demo
In order to match names and their description in brackets use \[([^\]]+)\]\(([^)]+)
Explanation:
\[ - match [ literally
([^\]]+) - match one or more characters other from ] and store it in first captuirng group
\] - match [ literally
\( - match ( literally
([^)]+) - match one or more characters other from )
Demo

parse string by the last occurrence of the character before another character in python

I have the following string that I need to split into smaller ones in a correct way:
s = "A=3, B=value one, value two, value three, C=NA, D=Other institution, except insurance, id=DRT_12345"
I cannot do the following, since I need to split only on the last "," before the "="
s.split(",")
My desired outcome is the following:
out = ["A=3",
"B=value one, value two, value three",
"C=NA",
"D=Other institution, except insurance",
"id=DRT_12345"]
Following the structure of your string, you can use re.findall:
import re
re.findall(r'\S+=.*?(?=, \S+=|$)', s)
['A=3',
'B=value one, value two, value three',
'C=NA',
'D=Other institution, except insurance',
'id=DRT_12345']
The pattern uses a lookahead to determine when to stop matching for the current key-value pair.
\S+ # match or more non-whitespace characters
= # ...followed by an equal sign
.*? # match anything upto...
(?= # regex lookahead for
, # comma, followed by
\s # a whitespace, followed by
\S+ # the same pattern
=
| # OR
$ # EOL
)
Split on "the last comma before the equals" can be translated to a regex like this:
import re
out = re.split(r',(?=[^,]*=)', s)
That's a comma (,), followed by (a positive lookahead - (?= .. )) any number of non-comma characters ([^,]*) and then an equals sign (=).

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

python regex get value after string

I am trying to parse a comma separated string keyword://pass#ip:port.
The string is a comma separated string, however the password can contain any character including comma. hence I can not use a split operation based on comma as delimiter.
I have tried to use regex to get the string after "myserver://" and later on I can split the rest of the information by using string operation (pass#ip:port/key1) but I could not make it working as I can not fetch the information after the above keyword.
myserver:// is a hardcoded string, and I need to get whatever follows each myserver as a comma separated list (i.e. pass#ip:port/key1, pass2#ip2:port2/key2, etc)
This is the closest I can get:
import re
my_servers="myserver://password,123#ip:port/key1,myserver://pass2#ip2:port2/key2"
result = re.search(r'myserver:\/\/(.*)[,(.*)|\s]', my_servers)
using search I tries to find the occurrence of the "myserver://" keyword followed by any characters, and ends with comma (means it will be followed by myserver://zzz,myserver://qqq) or space (incase of single myserver:// element, but I do not know how to do this better apart of using space as end-indicator). However this does not come out right. How can I do this better with regex?
You may consider the following splitting approach if you do not need to keep myserver:// in the results:
filter(None, re.split(r'\s*,?\s*myserver://', s))
The \s*,?\s*myserver:// pattern matches an optional , enclosed with 0+ whitespaces and then myserver:// substring. See this regex demo. Note we need to remove empty entries to get rid of an empty leading entry as when the match is found at the string start, the empty string at the beginning will be added to the resulting list.
Alternatively, you can use the lookahead based pattern with a lazy dot matching pattern with re.findall:
rx = r"myserver://(.*?)(?=\s*,\s*myserver://|$)"
See the Python demo
Details:
myserver:// - a literal substring
(.*?) - Capturing group 1 whose contents will be returned by re.findall matching any 0+ chars other than line break chars, as few as possible, up to the first occurrence (but excluding it)
(?=\s*,\s*myserver://|$) - either of the 2 alternatives:
\s*,\s*myserver:// - , enclosed with 0+ whitespaces and then a literal myserver:// substring
| - or
$ - end of string.
Here is the regex demo.
See a Python demo for the both approaches:
import re
s = "myserver://password,123#ip:port/key1,myserver://pass2#ip2:port2/key2"
rx1 = r'\s*,?\s*myserver://'
res1 = filter(None, re.split(rx1, s))
print(res1)
#or
rx2 = r"myserver://(.*?)(?=\s*,\s*myserver://|$)"
res2 = re.findall(rx2, s)
print(res2)
Both will print ['password,123#ip:port/key1', 'pass2#ip2:port2/key2'].

Categories

Resources