Python Regex Split Keeps Split Pattern Characters - python

Easiest way to explain this is an example:
I have this string: 'Docs/src/Scripts/temp'
Which I know how to split two different ways:
re.split('/', 'Docs/src/Scripts/temp') -> ['Docs', 'src', 'Scripts', 'temp']
re.split('(/)', 'Docs/src/Scripts/temp') -> ['Docs', '/', 'src', '/', 'Scripts', '/', 'temp']
Is there a way to split by the forward slash, but keep the slash part of the words?
For example, I want the above string to look like this:
['Docs/', '/src/', '/Scripts/', '/temp']
Any help would be appreciated!

Interesting question, I would suggest doing something like this:
>>> 'Docs/src/Scripts/temp'.replace('/', '/\x00/').split('\x00')
['Docs/', '/src/', '/Scripts/', '/temp']
The idea here is to first replace all / characters by two / characters separated by a special character that would not be a part of the original string. I used a null byte ('\x00'), but you could change this to something else, then finally split on that special character.
Regex isn't actually great here because you cannot split on zero-length matches, and re.findall() does not find overlapping matches, so you would potentially need to do several passes over the string.
Also, re.split('/', s) will do the same thing as s.split('/'), but the second is more efficient.

A solution without split() but with lookaheads:
>>> s = 'Docs/src/Scripts/temp'
>>> r = re.compile(r"(?=((?:^|/)[^/]*/?))")
>>> r.findall(s)
['Docs/', '/src/', '/Scripts/', '/temp']
Explanation:
(?= # Assert that it's possible to match...
( # and capture...
(?:^|/) # the start of the string or a slash
[^/]* # any number of non-slash characters
/? # and (optionally) an ending slash.
) # End of capturing group
) # End of lookahead
Since a lookahead assertion is tried at every position in the string and doesn't consume any characters, it doesn't have a problem with overlapping matches.

1) You do not need regular expressions to split on a single fixed character:
>>> 'Docs/src/Scripts/temp'.split('/')
['Docs', 'src', 'Scripts', 'temp']
2) Consider using this method:
import os.path
def components(path):
start = 0
for end, c in enumerate(path):
if c == os.path.sep:
yield path[start:end+1]
start = end
yield path[start:]
It doesn't rely on clever tricks like split-join-splitting, which makes it much more readable, in my opinion.

If you don't insist on having slashes on both sides, it's actually quite simple:
>>> re.findall(r"([^/]*/)", 'Docs/src/Scripts/temp')
['Docs/', 'src/', 'Scripts/']
Neither re nor split are really cut out for overlapping strings, so if that's what you really want, I'd just add a slash to the start of every result except the first.

Try about this:
re.split(r'(/)', 'Docs/src/Scripts/temp')
From python's documentation
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the
occurrences of pattern. If capturing parentheses are used in pattern,
then the text of all groups in the pattern are also returned as part
of the resulting list. If maxsplit is nonzero, at most maxsplit splits
occur, and the remainder of the string is returned as the final
element of the list. (Incompatibility note: in the original Python 1.5
release, maxsplit was ignored. This has been fixed in later releases.)

I'm not sure there is an easy way to do this. This is the best I could come up with...
import re
lSplit = re.split('/', 'Docs/src/Scripts/temp')
print [lSplit[0]+'/'] + ['/'+x+'/' for x in lSplit][1:-1] + ['/'+lSplit[len(lSplit)-1]]
Kind of a mess, but it does do what you wanted.

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

re.match never returns None? [duplicate]

There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).
As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$
The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$
You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']
You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS

regex to match character sequence and only one type of symbol

I have a regex for matching sequences of characters. i want it to only match if one type of separator (a space, "/" or "-") is used, not a combination of all.
^(([1-9]|1[0-3]|A|J|Q|K|a|j|q|k)(C|D|H|S|c|d|h|s))( |\-|\/)(([1-9]|1[0-3]|A|J|Q|K|a|j|q|k)(C|D|H|S|c|d|h|s))( |\-|\/)(([1-9]|1[0-3]|A|J|Q|K|a|j|q|k)(C|D|H|S|c|d|h|s))( |\-|\/)(([1-9]|1[0-3]|A|J|Q|K|a|j|q|k)(C|D|H|S|c|d|h|s))( |\-|\/)(([1-9]|1[0-3]|A|J|Q|K|a|j|q|k)(C|D|H|S|c|d|h|s))
for example i want it to match:
as/3d/0S/Td/13C
but not:
As/3d-QS/Ad/13C
or:
As-2d QS/Td 13C
You can first, simplify the regex a lot, it would be more readable
(C|D|H|S|c|d|h|s) -> [CDHScdhs]
([1-9]|1[0-3]|A|J|Q|K|a|j|q|k) -> ([1-9]|1[0-3]|[AJQKajqk])
( |\-|\/) -> [ \/-]
Then you may use backreference to assure that the same separator is used, it represents the index of the group to use, after simplification it's the group 2 in my regex take a look
Also as all part are the same you could simplify in
^([1-9]|1[0-3]|[AJQKajqk])[CDHScdhs]([ \/-])(([1-9]|1[0-3]|[AJQKajqk])[CDHScdhs]\2?)+$
But setting the re.IGNORECASE flag, you can remove uppercase letters
^([1-9]|1[0-3]|[ajqk])[cdhs]([ \/-])(([1-9]|1[0-3]|[ajqk])[cdhs]\2?)+$
==> FINAL REGEX
Replace all, but the first of occurance of this ( |-|/) to this \4. So you backreferencing what it was matched first and expect that everywhere else. (demo)
A solution without regex seems much cleaner:
def check_string(the_string):
# return True if exactly two of ' ', '-', '/' do not appear in the_string
counters = [the_string.count(' '), the_string.count('/'), the_string.count('-')]
return counters.count(0) == 2

Find String Between Two Substrings in Python When There is A Space After First Substring

While there are several posts on StackOverflow that are similar to this, none of them involve a situation when the target string is one space after one of the substrings.
I have the following string (example_string):
<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>
I want to extract "I want this string." from the string above. The randomletters will always change, however the quote "I want this string." will always be between [?] (with a space after the last square bracket) and Reduced.
Right now, I can do the following to extract "I want this string".
target_quote_object = re.search('[?](.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text[2:])
This eliminates the ] and that always appear at the start of my extracted string, thus only printing "I want this string." However, this solution seems ugly, and I'd rather make re.search() return the current target string without any modification. How can I do this?
Your '[?](.*?)Reduced' pattern matches a literal ?, then captures any 0+ chars other than line break chars, as few as possible up to the first Reduced substring. That [?] is a character class formed with unescaped brackets, and the ? inside a character class is a literal ? char. That is why your Group 1 contains the ] and a space.
To make your regex match [?] you need to escape [ and ? and they will be matched as literal chars. Besides, you need to add a space after ] to actually make sure it does not land into Group 1. A better idea is to use \s* (0 or more whitespaces) or \s+ (1 or more occurrences).
Use
re.search(r'\[\?]\s*(.*?)Reduced', example_string)
See the regex demo.
import re
rx = r"\[\?]\s*(.*?)Reduced"
s = "<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>"
m = re.search(r'\[\?]\s*(.*?)Reduced', s)
if m:
print(m.group(1))
# => I want this string.
See the Python demo.
Regex may not be necessary for this, provided your string is in a consistent format:
mystr = '<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
res = mystr.split('Reduced')[0].split('] ')[1]
# 'I want this string.'
The solution turned out to be:
target_quote_object = re.search('] (.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text)
However, Wiktor's solution is better.
You [co]/[sho]uld use Positive Lookbehind (?<=\[\?\]) :
import re
pattern=r'(?<=\[\?\])(\s\w.+?)Reduced'
string_data='<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
print(re.findall(pattern,string_data)[0].strip())
output:
I want this string.
Like the other answer, this might not be necessary. Or just too long-winded for Python.
This method uses one of the common string methods find.
str.find(sub,start,end) will return the index of the first occurrence of sub in the substring str[start:end] or returns -1 if none found.
In each iteration, the index of [?] is retrieved following with index of Reduced. Resulting substring is printed.
Every time this [?]...Reduced pattern is returned, the index is updated to the rest of the string. The search is continued from that index.
Code
s = ' [?] Nice to meet you.Reduced efweww [?] Who are you? Reduced<insert_randomletters>[?] I want this
string.Reduced<insert_randomletters>'
idx = s.find('[?]')
while idx is not -1:
start = idx
end = s.find('Reduced',idx)
print(s[start+3:end].strip())
idx = s.find('[?]',end)
Output
$ python splmat.py
Nice to meet you.
Who are you?
I want this string.

how do you do regex in python

I have a string like this:
data='WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
I need to get rid of everything until the first instance of the underline (inclusive) in regex.
I've tried this:
re.sub("(^.*\_),"", data)
but this get rids of everything before all underlines
ProcessCpuUsage
I need it to be:
jvmRuntimeModule_ProcessCpuUsag
Use this instead:
from string import find
data='WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
result = data[find(data, "_")+1:]
print result
re.sub("(^.*\_),"", data)
This makes . match every character in the line. Once it gets to the end, and can't match any more ".", it goes to the next token. Oops, that's a underscore! So, it backtracks back before the _ProcessCpuUsage, where it can match a underscore at the start, and then complete the match.
You should ask the . multiplier to be less greedy. You also do not need to capture the contents. Drop the parens. The backslash does nothing. Drop it. The leading line-start anchor also does nothing. Drop it.
re.sub(".*?_,", data)
You have become a victim of greedy matching. The expression matches the longest sequence that it possibly can.
I know there's a way to turn off greedy matching, but I never remember it. Instead there's a trick I use when there's a character I want to stop at. Instead of matching on every character with . I match on every character except the one I want to stop at.
re.sub("(^[^_]*\_", "", data)
This should do:
import re
def get_last_part(d):
m = re.match('[^_]*_(.*)', d)
if m:
return m.group(1)
else:
return None
print get_last_part('WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage')
you can use str.index:
>>> data = 'WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
>>> data[data.index('_')+1:]
'jvmRuntimeModule_ProcessCpuUsage'
Using str.split
>>> data.split('_',1)[1]
'jvmRuntimeModule_ProcessCpuUsage'
Using str.find:
>>> data[data.find('_')+1:]
'jvmRuntimeModule_ProcessCpuUsage'
Take a look at string methods Here
Try this regex:
result = re.sub("^.*?_", "", text)
What the regex ^.*?_ does:
^ .. Assert that the position is at the beginning of the string.
.*? .. Match every character that is not a linebreak character
between zero and unlimitted times as few times as possible.
- .. Match the character _
Try using split():
s = 'WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
print(s.split('_',1)[1])
Result:
jvmRuntimeModule_ProcessCpuUsage

Categories

Resources