Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?
>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)
Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2
I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)
I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.
>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']
It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.
I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!
In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']
As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.
Related
given a string as shown below,
"[xyx],[abc].[cfd],[abc].[dgr],[abc]"
how to print it like shown below ?
1.[xyz]
2.[cfd]
3.[dgr]
The original string will always maintain the above-mentioned format.
I did not realize you had periods and commas... that adds a bit of trickery. You have to split on the periods too
I would use something like this...
list_to_parse = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
count = 0
for i in list_to_parse.split('.'):
for j in i.split(','):
string = str(count + 1) + "." + j
if string:
count += 1
print(string)
string = None
Another option is split on the left bracket, and then just re-add it with enumerate - then strip commas and periods - this method is also probably a tiny bit faster, as it's not a loop inside a loop
list_to_parse = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
for index, i in enumerate(list.split('[')):
if i:
print(str(index) + ".[" + i.rstrip(',.'))
also strip is really "what characters to remove" not a specific pattern. so you can add any characters you want removed from the right, and it will work through the list until it hits a character it can't remove. there is also lstrip() and strip()
string manipulation can always get tricky, so pay attention. as this will output a blank first object, so index zero isn't printed etc... always practice and learn your needs :D
You can use split() function:
a = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
desired_strings = [i.split(',')[0] for i in a.split('.')]
for i,string in enumerate(desired_strings):
print(f"{i+1}.{string}")
This is just a fun way to solve it:
lst = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
count = 1
var = 1
for char in range(0, len(lst), 6):
if var % 2:
print(f"{count}.{lst[char:char + 5]}")
count += 1
var += 1
output:
1.[xyx]
2.[cfd]
3.[dgr]
explanation : "[" appears in these indexes: 0, 6, 12, etc. var is for skipping the next pair. count is the counting variable.
Here we can squeeze the above code using list comprehension and slicing instead of those flag variables. It's now more Pythonic:
lst = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
lst = [lst[i:i+5] for i in range(0, len(lst), 6)][::2]
res = (f"{i}.{item}" for i, item in enumerate(lst, 1))
print("\n".join(res))
You can use RegEx:
import regex as re
pattern=r"(\[[a-zA-Z]*\])\,\[[a-zA-Z]*\]\.?"
results=re.findall(pattern, '[xyx],[abc].[cfd],[abc].[dgr],[abc]')
print(results)
Using re.findall:
import re
s = "[xyx],[abc].[cfd],[abc].[dgr],[abc]"
print('\n'.join(f'{i+1}.{x}' for i,x in
enumerate(re.findall(r'(\[[^]]+\])(?=,)', s))))
Output:
1.[xyx]
2.[cfd]
3.[dgr]
I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?
re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.
I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe
r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())
I have a string s, a pattern p and a replacement r, i need to get the list of strings in which only one match with p has been replaced with r.
Example:
s = 'AbcAbAcc'
p = 'A'
r = '_'
// Output:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
I have tried with re.finditer(p, s) but i couldn't figure out how to replace each match with r.
You can replace them manually after finding all the matches:
[s[:m.start()] + r + s[m.end():] for m in re.finditer(p,s)]
The result is:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
How does it work?
re.finditer(p,s) will find all matches (each will be a re.Match
object)
the re.Match objects have start() and end() method which return the location of the match
you can replace the part of string between begin and end using this code: s[:begin] + replacement + s[end:]
You don't need regex for this, it's as simple as
[s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
Full code: See it working here
s = 'AbcAbAcc'
p = 'A'
r = '_'
x = [s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
print(x)
Outputs:
['_bcAbAcc', 'Abc_bAcc', 'AbcAb_cc']
As mentioned, this only works on one character, for anything longer than one character or requiring a regex, use zvone's answer.
For a performance comparison between mine and zvone's answer (plus a third method of doing this without regex), see here or test it yourself with the code below:
import timeit,re
s = 'AbcAbAcc'
p = 'A'
r = '_'
def x1():
return [s[:i]+r+s[i+1:] for i,c in enumerate(s) if c==p]
def x2():
return [s[:i]+r+s[i+1:] for i in range(len(s)) if s[i]==p]
def x3():
return [s[:m.start()] + r + s[m.end():] for m in re.finditer(p,s)]
print(x1())
print(timeit.timeit(x1, number=100000))
print(x2())
print(timeit.timeit(x2, number=100000))
print(x3())
print(timeit.timeit(x3, number=100000))
This might be more information than necessary to explain my question, but I am trying to combine 2 scripts (I wrote for other uses) together to do the following.
TargetString (input_file) 4FOO 2BAR
Result (output_file) 1FOO 2FOO 3FOO 4FOO 1BAR 2BAR
My first script finds the pattern and copies to file_2
pattern = "\d[A-Za-z]{3}"
matches = re.findall(pattern, input_file.read())
f1.write('\n'.join(matches))
My second script opens the output_file and, using re.sub, replaces and alters the target string(s) using capturing groups and back-references. But I am stuck here on how to turn i.e. 3 into 1 2 3.
Any ideas?
This simple example doesn't need to use regular expression, but if you want to use re anyway, here's example (note: you have minor error in your pattern, should be A-Z, not A-A):
text_input = '4FOO 2BAR'
import re
matches = re.findall(r"(\d)([A-Za-z]{3})", text_input)
for (count, what) in matches:
for i in range(1, int(count)+1):
print(f'{i}{what}', end=' ')
print()
Prints:
1FOO 2FOO 3FOO 4FOO 1BAR 2BAR
Note: If you want to support multiple digits, you can use (\d+) - note the + sign.
Assuming your numbers are between 1 and 9, without regex, you can use a list comprehension with f-strings (Python 3.6+):
L = ['4FOO', '2BAR']
res = [f'{j}{i[1:]}' for i in L for j in range(1, int(i[0])+1)]
['1FOO', '2FOO', '3FOO', '4FOO', '1BAR', '2BAR']
Reading and writing to CSV files are covered elsewhere: read, write.
More generalised, to account for numbers greater than 9, you can use itertools.groupby:
from itertools import groupby
L = ['4FOO', '10BAR']
def make_var(x, int_flag):
return int(''.join(x)) if int_flag else ''.join(x)
vals = ((make_var(b, a) for a, b in groupby(i, str.isdigit)) for i in L)
res = [f'{j}{k}' for num, k in vals for j in range(1, num+1)]
print(res)
['1FOO', '2FOO', '3FOO', '4FOO', '1BAR', '2BAR', '3BAR', '4BAR',
'5BAR', '6BAR', '7BAR', '8BAR', '9BAR', '10BAR']
How can I obtain the number of overlapping regex matches using Python?
I've read and tried the suggestions from this, that and a few other questions, but found none that would work for my scenario. Here it is:
input example string: akka
search pattern: a.*k
A proper function should yield 2 as the number of matches, since there are two possible end positions (k letters).
The pattern might also be more complicated, for example a.*k.*a should also be matched twice in akka (since there are two k's in the middle).
I think that what you're looking for is probably better done with a parsing library like lepl:
>>> from lepl import *
>>> parser = Literal('a') + Any()[:] + Literal('k')
>>> parser.config.no_full_first_match()
>>> list(parser.parse_all('akka'))
[['akk'], ['ak']]
>>> parser = Literal('a') + Any()[:] + Literal('k') + Any()[:] + Literal('a')
>>> list(parser.parse_all('akka'))
[['akka'], ['akka']]
I believe that the length of the output from parser.parse_all is what you're looking for.
Note that you need to use parser.config.no_full_first_match() to avoid errors if your pattern doesn't match the whole string.
Edit: Based on the comment from #Shamanu4, I see you want matching results starting from any position, you can do that as follows:
>>> text = 'bboo'
>>> parser = Literal('b') + Any()[:] + Literal('o')
>>> parser.config.no_full_first_match()
>>> substrings = [text[i:] for i in range(len(text))]
>>> matches = [list(parser.parse_all(substring)) for substring in substrings]
>>> matches = filter(None, matches) # Remove empty matches
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results
>>> matches = list(itertools.chain.from_iterable(matches)) # Flatten results (again)
>>> matches
['bboo', 'bbo', 'boo', 'bo']
Yes, it is ugly and unoptimized but it seems to be working. This is a simple try of all possible but unique variants
def myregex(pattern,text,dir=0):
import re
m = re.search(pattern, text)
if m:
yield m.group(0)
if len(m.group('suffix')):
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
yield r
if dir<1 :
for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
yield r
def myprocess(pattern, text):
parts = pattern.split("*")
for i in range(0, len(parts)-1 ):
res=""
for j in range(0, len(parts) ):
if j==0:
res+="(?P<prefix>"
if j==i:
res+=")(?P<suffix>"
res+=parts[j]
if j==i+1:
res+=")(?P<end>"
if j<len(parts)-1:
if j==i:
res+=".*"
else:
res+=".*?"
else:
res+=")"
for r in myregex(res,text):
yield r
def mycount(pattern, text):
return set(myprocess(pattern, text))
test:
>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])