Split string based on regexp without consuming characters [duplicate] - python

This question already has answers here:
Non-consuming regular expression split in Python
(2 answers)
Closed 8 years ago.
I would like to split a string like the following
text="one,two;three.four:"
into the list
textOut=["one", ",two", ";three", ".four", ":"]
I have tried with
import re
textOut = re.split(r'(?=[.:,;])', text)
But this does not split anything.

I would use re.findall here instead of re.split:
>>> from re import findall
>>> text = "one,two;three.four:"
>>> findall("(?:^|\W)\w*", text)
['one', ',two', ';three', '.four', ':']
>>>
Below is a breakdown of the Regex pattern used above:
(?: # The start of a non-capturing group
^|\W # The start of the string or a non-word character (symbol)
) # The end of the non-capturing group
\w* # Zero or more word characters (characters that are not symbols)
For more information, see here.

I don't know what else can occur in your string, but will this do the trick?
>>> s='one,two;three.four:'
>>> [x for x in re.findall(r'[.,;:]?\w*', s) if x]
['one', ',two', ';three', '.four', ':']

Related

Split only first group of string with regex in python [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
I want to search for a specific string can anyone tell me why I am seeing below result?
I checked it out in an Online regex site, It seems I have seperated in to 3 groups and now the result is printing the 3 groups. how I can only seperate the first group?
Also is it possible to change the code so the "String" with lower case would be detected?
Relative String
DD-JSH-String43423213-3774
DE-String43423214-SDC-3721
Output:
'String43423213', 'String', '43423213','String43423214', 'String', '43423214'
Code:
matches = re.findall(r'((String)(\d+))', inp)
matches = [j for sub in matches for j in sub if j != ""]
Expected Result:
'String43423213', 'String43423214'
This is because you are even grouping on the two matches, so you have to remove the outer group. And also you can add flag re.I to ignore case:
matches = re.findall(r'(String)(\d+)', inp, flags=re.I)
print(*[''.join(x) for x in matches],sep="\n")
Try this regex-demo:
python source:
input="""DD-JSH-String43423213-3774
DE-String43423214-SDC-3721"""
matches = re.findall(r'String\d+', input, flags=re.I)
print(matches)
or
matches = re.findall(r'(?i)String\d+', input)
print(matches)
output:
['String43423213', 'String43423214']
explanation:
Because your regex has two groups String and \d+, re.findall returns a list that contains all tuples of the two groups like ('String', 'String43423214'). You could group it like (String\d+) or non-group like String\d+, both expressions are working.
you can do this:
import re
inp = """
DD-JSH-String43423213-3774
DE-String43423214-SDC-3721
"""
matches = matches = re.findall(r'String\d+', inp)
for match in matches:
print(match)

Need to find '$word;' pattern in string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have big text file and I have to find all words starts with '$' and ends with ';' like $word;.
import re
text = "$h;BREWERY$h_end;You've built yourself a brewery."
x = re.findall("$..;", text)
print(x)
I want my output like ['$h;', '$h_end;'] How can I do that?
I have to find all words starts with '$' and ends with ';' like $word;.
I would do:
import re
text = "$h;BREWERY$h_end;You've built yourself a brewery."
result = re.findall('\$[^;]+;',text)
print(result)
Output:
['$h;', '$h_end;']
Note that $ needs to be escaped (\$) as it is one of special characters. Then I match 1 or more occurences of anything but ; and finally ;.
You may use
\$\w+;
See the regex demo. Details:
\$ - a $ char
\w+ - 1+ letters, digits, _ (=word) chars
; - a semi-colon.
Python demo:
import re
text = "$h;BREWERY$h_end;You've built yourself a brewery."
x = re.findall(r"\$\w+;", text)
print(x) # => ['$h;', '$h_end;']

Python re can't split zero-width anchors? [duplicate]

This question already has answers here:
Python regex: splitting on pattern match that is an empty string
(2 answers)
Closed 5 years ago.
import re
s = 'PythonCookbookListOfContents'
# the first line does not work
print re.split('(?<=[a-z])(?=[A-Z])', s )
# second line works well
print re.sub('(?<=[a-z])(?=[A-Z])', ' ', s)
# it should be ['Python', 'Cookbook', 'List', 'Of', 'Contents']
How to split a string from the border of a lower case character and an upper case character using Python re?
Why does the first line fail to work while the second line works well?
According to re.split:
Note that split will never split a string on an empty pattern match.
For example:
>>> re.split('x*', 'foo')
['foo']
>>> re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']
How about using re.findall instead? (Instead of focusing on separators, focus on the item you want to get.)
>>> import re
>>> s = 'PythonCookbookListOfContents'
>>> re.findall('[A-Z][a-z]+', s)
['Python', 'Cookbook', 'List', 'Of', 'Contents']
UPDATE
Using regex module (Alternative regular expression module, to replace re), you can split on zero-width match:
>>> import regex
>>> s = 'PythonCookbookListOfContents'
>>> regex.split('(?<=[a-z])(?=[A-Z])', s, flags=regex.VERSION1)
['Python', 'Cookbook', 'List', 'Of', 'Contents']
NOTE: Specify regex.VERSION1 flag to enable split-on-zero-length-match behavior.

Regex can't escape question mark? [duplicate]

This question already has an answer here:
match trailing slash with Python regex
(1 answer)
Closed 8 years ago.
I can't match the question mark character although I escaped it.
I tried escaping with multiple backslashes and also using re.escape().
What am I missing?
Code:
import re
text = 'test?'
result = ''
result = re.match(r'\?',text)
print ("input: "+text)
print ("found: "+str(result))
Output:
input: test?
found: None
re.match only matches a pattern at the begining of string; as in the docs:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.
so, either:
>>> re.match(r'.*\?', text).group(0)
'test?
or re.search
>>> re.search(r'\?', text).group(0)
'?'

Python: Split a string, respect and preserve quotes [duplicate]

This question already has answers here:
Split a string by spaces -- preserving quoted substrings -- in Python
(16 answers)
Closed 7 years ago.
Using python, I want to split the following string:
a=foo, b=bar, c="foo, bar", d=false, e="false"
This should result in the following list:
['a=foo', 'b=bar', 'c="foo, bar"', 'd=false', 'e="false'"']
When using shlex in posix-mode and splitting with ", ", the argument for cgets treated correctly. However, it removes the quotes. I need them because false is not the same as "false", for instance.
My code so far:
import shlex
mystring = 'a=foo, b=bar, c="foo, bar", d=false, e="false"'
splitter = shlex.shlex(mystring, posix=True)
splitter.whitespace += ','
splitter.whitespace_split = True
print list(splitter) # ['a=foo', 'b=bar', 'c=foo, bar', 'd=false', 'e=false']
>>> s = r'a=foo, b=bar, c="foo, bar", d=false, e="false", f="foo\", bar"'
>>> re.findall(r'(?:[^\s,"]|"(?:\\.|[^"])*")+', s)
['a=foo', 'b=bar', 'c="foo, bar"', 'd=false', 'e="false"', 'f="foo\\", bar"']
The regex pattern "[^"]*" matches a simple quoted string.
"(?:\\.|[^"])*" matches a quoted string and skips over escaped quotes because \\. consumes two characters: a backslash and any character.
[^\s,"] matches a non-delimiter.
Combining patterns 2 and 3 inside (?: | )+ matches a sequence of non-delimiters and quoted strings, which is the desired result.

Categories

Resources