Python: Split a string, respect and preserve quotes [duplicate] - python

This question already has answers here:
Split a string by spaces -- preserving quoted substrings -- in Python
(16 answers)
Closed 7 years ago.
Using python, I want to split the following string:
a=foo, b=bar, c="foo, bar", d=false, e="false"
This should result in the following list:
['a=foo', 'b=bar', 'c="foo, bar"', 'd=false', 'e="false'"']
When using shlex in posix-mode and splitting with ", ", the argument for cgets treated correctly. However, it removes the quotes. I need them because false is not the same as "false", for instance.
My code so far:
import shlex
mystring = 'a=foo, b=bar, c="foo, bar", d=false, e="false"'
splitter = shlex.shlex(mystring, posix=True)
splitter.whitespace += ','
splitter.whitespace_split = True
print list(splitter) # ['a=foo', 'b=bar', 'c=foo, bar', 'd=false', 'e=false']

>>> s = r'a=foo, b=bar, c="foo, bar", d=false, e="false", f="foo\", bar"'
>>> re.findall(r'(?:[^\s,"]|"(?:\\.|[^"])*")+', s)
['a=foo', 'b=bar', 'c="foo, bar"', 'd=false', 'e="false"', 'f="foo\\", bar"']
The regex pattern "[^"]*" matches a simple quoted string.
"(?:\\.|[^"])*" matches a quoted string and skips over escaped quotes because \\. consumes two characters: a backslash and any character.
[^\s,"] matches a non-delimiter.
Combining patterns 2 and 3 inside (?: | )+ matches a sequence of non-delimiters and quoted strings, which is the desired result.

Related

Seek clarity on regex with $ and \Z [duplicate]

This question already has answers here:
Dollar sign in regular expression and new line character
(2 answers)
Closed 5 months ago.
There is wee confusion between $ and \Z in regex. I understand the underlying concept,
\Z matches the end of the string regardless of the multiline mode where as
$ matches end of the string or just before "\n" in multiline mode.
import re
items = ['lovely', '1\dentist', '2 lonely', 'eden', 'fly\n', 'dent']
# res = [e for e in items if re.search(r'\Aden|ly\Z', e)]
t = re.compile(r"^den|ly$")
res = [e for e in items if t.search(e)]
print(res)
res = ['lovely', '2 lonely', 'fly\n', 'dent']
Why am I matching "fly\n", It ends with "\n" so isn't it suppose to ignore it where as r"^den|ly\Z" get me the desired result.
Note the python documentation for the $ special character:
Matches the end of the string or just before the newline at the end of the string [emphasis added] […]
In "fly\n", the newline is at the end of the string, so '$' can match just before it. If instead the string were "fly\n\n", then the regex would fail.

Need to find '$word;' pattern in string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have big text file and I have to find all words starts with '$' and ends with ';' like $word;.
import re
text = "$h;BREWERY$h_end;You've built yourself a brewery."
x = re.findall("$..;", text)
print(x)
I want my output like ['$h;', '$h_end;'] How can I do that?
I have to find all words starts with '$' and ends with ';' like $word;.
I would do:
import re
text = "$h;BREWERY$h_end;You've built yourself a brewery."
result = re.findall('\$[^;]+;',text)
print(result)
Output:
['$h;', '$h_end;']
Note that $ needs to be escaped (\$) as it is one of special characters. Then I match 1 or more occurences of anything but ; and finally ;.
You may use
\$\w+;
See the regex demo. Details:
\$ - a $ char
\w+ - 1+ letters, digits, _ (=word) chars
; - a semi-colon.
Python demo:
import re
text = "$h;BREWERY$h_end;You've built yourself a brewery."
x = re.findall(r"\$\w+;", text)
print(x) # => ['$h;', '$h_end;']

Remove chars from string using Regular Expression [duplicate]

This question already has answers here:
Remove specific characters from a string in Python
(26 answers)
Closed 5 years ago.
Given an array of strings which contains alphanumeric characters but also punctuations that have to be deleted. For instance the string x="0-001" is converted into x="0001".
For this purpose I have:
punctuations = list(string.punctuation)
Which contain all the characters that have to be removed from the strings. I'm trying to solve this using regular expressions in python, any suggestion on how to proceed using regular expressions?
import string
punctuations = list(string.punctuation)
test = "0000.1111"
for i, char in enumerate(test):
if char in punctuations:
test = test[:i] + test[i+ 1:]
If all you want to do is remove non-alphanumeric characters from a string, you can do it simply with re.sub:
>>> re.sub('\W', '', '0-001')
'0001'
Note, the \W will match any character which is not a Unicode word character. This is the opposite of \w. For ASCII strings it's equivalent to [^a-zA-Z0-9_].

Regex can't escape question mark? [duplicate]

This question already has an answer here:
match trailing slash with Python regex
(1 answer)
Closed 8 years ago.
I can't match the question mark character although I escaped it.
I tried escaping with multiple backslashes and also using re.escape().
What am I missing?
Code:
import re
text = 'test?'
result = ''
result = re.match(r'\?',text)
print ("input: "+text)
print ("found: "+str(result))
Output:
input: test?
found: None
re.match only matches a pattern at the begining of string; as in the docs:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.
so, either:
>>> re.match(r'.*\?', text).group(0)
'test?
or re.search
>>> re.search(r'\?', text).group(0)
'?'

Split string based on regexp without consuming characters [duplicate]

This question already has answers here:
Non-consuming regular expression split in Python
(2 answers)
Closed 8 years ago.
I would like to split a string like the following
text="one,two;three.four:"
into the list
textOut=["one", ",two", ";three", ".four", ":"]
I have tried with
import re
textOut = re.split(r'(?=[.:,;])', text)
But this does not split anything.
I would use re.findall here instead of re.split:
>>> from re import findall
>>> text = "one,two;three.four:"
>>> findall("(?:^|\W)\w*", text)
['one', ',two', ';three', '.four', ':']
>>>
Below is a breakdown of the Regex pattern used above:
(?: # The start of a non-capturing group
^|\W # The start of the string or a non-word character (symbol)
) # The end of the non-capturing group
\w* # Zero or more word characters (characters that are not symbols)
For more information, see here.
I don't know what else can occur in your string, but will this do the trick?
>>> s='one,two;three.four:'
>>> [x for x in re.findall(r'[.,;:]?\w*', s) if x]
['one', ',two', ';three', '.four', ':']

Categories

Resources