Recursively capture patterns in regex - Python

Recursively capture patterns in regex - Python - python

Given the solution in How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?, I was able to capture the prefix and the values of the desired pattern denoted by a CAPITALIZED.PREFIX and values within angle brackets < "value1" , "value2", ... >
"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl",
'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR
<"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG
"<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""
However I get into problems with i have strings like the one above. The desired output would be:
('ORTH.FOO', ['cali.ber,kl','calf','done'])
('ORHT2BAR', ['what so ever >', 'this that mess < up'])
('JOKE', ['whathe ', 'what'])
I have tried the following but it only give me the 1st tuple, how do i get all possible tuples as in the desired output?:
import re
intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up">\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30" ]."""
pattern = re.compile(r'.*?([A-Z0-9\.]*) < ([^>]*) >.*', flags=re.DOTALL)
f, v = pattern.match(intext).groups()
names = re.findall('[\'"](.*?)["\']', v)
print f, names

Huh silly me. Somehow, I wasn't testing the whole string on my machine ^^;
Anyway, I used this regex and it works, you just get the results you were looking for in a list, which I guess is okay. I'm not too good in python, and don't know how to transform this list into array or tuple:
>>> import re
>>> intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""
>>> results = re.findall('\\n .*?([A-Z0-9\.]*) < *((?:[^>\n]|>")*) *>.*?(?:\\n|$)', intext)
>>> print results
[('ORTH.FOO', '"cali.ber,kl", \'calf\', "done"'), ('ORHT2BAR', '"what so ever>", "this that mess < up"'), ('JOKE', '\'whatthe \', "what" ')]
The parentheses indicate the first level elements and the single quotes the second level elements.

Regular expressions do not support 'recursive' parsing. Process the group between the < and > characters after capturing it with a regular expression.
The shlex module would do nicely here to parse your quoted strings:
import shlex
import re
intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up">\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30" ]."""
pattern = re.compile(r'.*?([A-Z0-9\.]*) < ([^>]*) >.*', flags=re.DOTALL)
f, v = pattern.match(intext).groups()
parser = shlex.shlex(v, posix=True)
parser.whitespace += ','
names = list(parser)
print f, names
output:
ORTH.FOO ['cali.ber,kl', 'calf', 'done']

Related

Struggling with Regex for adjacent letters differing by case

I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?

re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.

I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe

r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())

Parsing text files with "magic" values

Background
I have some large text files used in an automation script for audio tuning. Each line in the text file looks roughly like:
A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]] BANANA # BANANA
The text gets fed to an old command-line program which searches for keywords, and swaps them out. Sample output would be:
A[0] + B[100] - C[0x1000] [[0]] 0 # 0
A[2] + B[200] - C[0x100A] [[2]] 0 # 0
Problem
Sometimes, text files have keywords that are meant to be left untouched (i.e. cases where we don't want "BANANA" substituted). I'd like to modify the text files to use some kind of keyword/delimiter that is unlikely to pop up in normal circumstances, i.e:
A[#1] + B[#2] - C[#3] [[#1]] #1 # #1
Question
Does python's text file parser have any special indexing/escape sequences I could use instead of simple keywords?

use a regular expression replacement function with a dictionary.
Match everything between brackets (non-greedy, avoiding the brackets themselves) and replace by the value of the dict, put original value if not found:
import re
d = {"BANANA":"12", "PINEAPPLE":"20","CHERRY":"100","BANANA":"400"}
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
print(re.sub("\[([^\[\]]*)\]",lambda m : "[{}]".format(d.get(m.group(1),m.group(1))),s))
prints:
A[400] + B[20] - C[100] [[400]]

You can use re.sub to perform the substitution. This answer creates a list of randomized values to demonstrate, however, the list can be replaces with the data you are using:
import re
import random
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
new_s = re.sub('(?<=\[)[a-zA-Z0-9]+(?=\])', '{}', s)
random_data = [[random.randint(1, 2000) for i in range(4)] for _ in range(10)]
final_results = [new_s.format(*i) for i in random_data]
for command in final_results:
print(command)
Output:
A[51] + B[134] - C[864] [[1344]]
A[468] + B[1761] - C[1132] [[1927]]
A[1236] + B[34] - C[494] [[1009]]
A[1330] + B[1002] - C[1751] [[1813]]
A[936] + B[567] - C[393] [[560]]
A[1926] + B[936] - C[906] [[1596]]
A[1532] + B[1881] - C[871] [[1766]]
A[506] + B[1505] - C[1096] [[491]]
A[290] + B[1841] - C[664] [[38]]
A[1552] + B[501] - C[500] [[373]]

Just use
\[([^][]+)\]
And replace this with the desired result, e.g. 123.
Broken down, this says
\[ # opening bracket
([^][]+) # capture anything not brackets, 1+ times
\] # closing bracket
See a demo on regex101.com.
For your changed requirements, you could use an OrderedDict:
import re
from collections import OrderedDict
rx = re.compile(r'\[([^][]+)\]')
d = OrderedDict()
def replacer(match):
item = match.group(1)
d[item] = 1
return '[#{}]'.format(list(d.keys()).index(item) + 1)
string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
string = rx.sub(replacer, string)
print(string)
Which yields
A[#1] + B[#2] - C[#3] [[#1]]
The idea here is to put every (potentially) new item in the dict, then search for the index. OrderedDicts remember the order entry.
For the sake of academic completeness, you could do it all on your own as well:
import re
class Replacer:
rx = re.compile(r'\[([^][]+)\]')
keywords = []
def do_replace(self, match):
idx = self.lookup(match.group(1))
return '[#{}]'.format(idx + 1)
def replace(self, string):
return self.rx.sub(self.do_replace, string)
def lookup(self, item):
for idx, key in enumerate(self.keywords):
if key == item:
return idx
self.keywords.append(item)
return len(self.keywords)-1
string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
rpl = Replacer()
string = rpl.replace(string)
print(string)

Can also be done using pyparsing.
This parser essentially defines noun to be the uppercase things within square brackets, then defines a sequence of them to be one line of input, as complete.
To replace items identified with other things define a class derived from dict in a suitable way, so that anything not in the class is left unchanged.
>>> import pyparsing as pp
>>> noun = pp.Word(pp.alphas.upper())
>>> between = pp.CharsNotIn('[]')
>>> leftbrackets = pp.OneOrMore('[')
>>> rightbrackets = pp.OneOrMore(']')
>>> stmt = 'A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]'
>>> one = between + leftbrackets + noun + rightbrackets
>>> complete = pp.OneOrMore(one)
>>> complete.parseString(stmt)
(['A', '[', 'BANANA', ']', ' + B', '[', 'PINEAPPLE', ']', ' - C', '[', 'CHERRY', ']', ' ', '[', '[', 'BANANA', ']', ']'], {})
>>> class Replace(dict):
... def __missing__(self, key):
... return key
...
>>> replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
>>> new = []
>>> for item in complete.parseString(stmt).asList():
... new.append(replace[item])
...
>>> ''.join(new)
'A[1] + B[2] - C[CHERRY] [[1]]'

I think it's easier — and clearer — using plex. The snag is that it appears to be available only for Py2. It took me an hour or two to make sufficient conversion work to Py3 to get this.
Just three types of tokens to watch for, then a similar number of branches within a while statement.
from plex import *
from io import StringIO
stmt = StringIO('A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]')
lexicon = Lexicon([
(Rep1(AnyBut('[]')), 'not_brackets'),
(Str('['), 'left_bracket'),
(Str(']'), 'right_bracket'),
])
class Replace(dict):
def __missing__(self, key):
return key
replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
scanner = Scanner(lexicon, stmt)
new_statement = []
while True:
token = scanner.read()
if token[0] is None:
break
elif token[0]=='no_brackets':
new_statement.append(replace[token[1]])
else:
new_statement.append(token[1])
print (''.join(new_statement))
Result:
A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]

Splitting contents in textfile

I have a text file that contains the following:
Number1 (E, P) (F, H)
Number2 (A, B) (C, D)
Number3 (I, J) (O, Z)
I know more or less how to read it and how to get the values of it into my program, but I wanted to know how to correctly split into "Number 1", "(E,P)" and "(F, H)". Also later, I want to be able to check in my program if "Number1" contains "(E, P)" or not.
def read_srg(name):
filename = name + '.txt'
fp = open(filename)
lines = fp.readlines()
R = {}
for line in lines:
??? = line.split()
fp.close()
return R

I think the easiest/most reliable way would be to use a regex:
import re
regex = re.compile(r"([^()]*) (\([^()]*\)) (\([^()]*\))")
with open("myfile.txt") as text:
for line in text:
contents = regex.match(line)
if contents:
label, g1, g2 = contents.groups()
# now do something with these values, e. g. add them to a list
Explanation:
([^()]*) # Match any number of characters besides parentheses --> group 1
[ ] # Match a space
(\([^()]*\)) # Match (, then any non-parenthesis characters, then ) --> group 2
[ ] # Match a space
(\([^()]*\)) # Match (, then any non-parenthesis characters, then ) --> group 3

Because of the whitespaces within the parentheses, you better go with a regular expression, than just splitting lines.
Here's your read_srg function, with the regex check integrated:
import re
def read_srg(name):
with open('%s.txt' % (name, ), 'r') as text:
matchstring = r'(Number[0-9]+) (\([A-Z,\s]+\)) (\([A-Z,\s]+\))'
R = {}
for i, line in enumerate(text):
match = re.match(matchstring, line)
if not match:
print 'skipping exception found in line %d: %s' % (i + 1, line)
continue
key, v1, v2 = match.groups()
R[key] = v1, v2
return R
from pprint import pformat
print pformat(read_srg('example'))
To read your dictionary and perform checks on keys and values, you can later do something like:
test_dict = read_srg('example')
for key, (v1, v2) in test_dict.iteritems():
matchstring = ''
if 'Number1' in key and '(E, P)' in v1:
matchstring = 'match found: '
print '%s%s > %s %s' % (matchstring, key, v1, v2)
A big advantage of this approach is that you can also use your regex to check that your file isn't malformed for some reason.
This is why the matching rule is quite strict:
matchstring = r'(Number[0-9]+) (\([A-Z,\s]+\)) (\([A-Z,\s]+\))'
(Number[0-9]+) will match only words made of Number followed by any number of digits
(\([A-Z,\s]+\)) will match only strings enclosed into () which contain capital letters or , or a whitespace
I read in your comment that the format of the file is always the same, so I'm assuming it is procedurally generated.
Still, you might want to check its integrity (or to be sure that your code does not break if at some point the procedure generating the txt file changes its formatting).
Depending how strict you want your sanity check to be, you can push the above even further:
if you know there should never be more than 3 digits after Number, you might change (Number[0-9]+) to (Number[0-9]{1,3}) (which limits the match to 1, 2 or 3 digits)
if you want to be sure the format in parentheses is always two single capital letters split by ", " you can change (\([A-Z,\s]+\)) to (\([A-Z], [A-Z]\))

You were really close. Try this:
def read_srg(name):
with open(name + '.txt', 'r') as f:
R = {}
for line in f:
line = line.replace(', ', ',') # Number1 (E, P) (F, H) -> Number1 (E,P) (F,H)
header, *contents = line.strip().split() # `header` gets the first item of the list and all the rest go to `contents`
R[header] = contents
return R
Checking for membership can be later done like so:
if "(E,P)" in R["Number1"]:
# do stuff
I did not test this but it should be fine. Let me know if anything comes up.

Python Regular Expressions; Having 0 match either 0 or 1

I have a shorter string s I'm trying to match to a longer string s1. 1's match 1's, but 0's will match either a 0 or a 1.
For instance:
s = '11111' would match s1 = '11111'
s = '11010' would match s1 = '11111' or '11011' or '11110' or '11010'
I know regular expressions would make this much easier but am confused on where to start.

Replace each instance of 0 with [01] to enable it matching either 0 or 1:
s = '11010'
pattern = s.replace('0', '[01]')
regex = re.compile(pattern)
regex.match('11111')
regex.match('11011')

It looks to me like you're actually looking for bit arithmetics
s = '11010'
n = int(s, 2)
for r in ('11111', '11011', '11110', '11010'):
if int(r, 2) & n == n:
print r, 'matches', s
else:
print r, 'doesnt match', s

import re
def matches(pat, s):
p = re.compile(pat.replace('0', '[01]') + '$')
return p.match(s) is not None
print matches('11111', '11111')
print matches('11111', '11011')
print matches('11010', '11111')
print matches('11010', '11011')
You say "match to a longer string s1", but you don't say whether you'd like to match the start of the string, or the end etc. Until I better understand your requirements, this performs an exact match.

How to extract the substring between two markers?

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.
I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
How to do the same thing in Python?

Using regular expressions - documentation for further reference
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.

regular expression
import re
re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)
The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text
string methods
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.
PS Python Challenge?

Surprised that nobody has mentioned this which is my quick version for one-off scripts:
>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'

you can do using just one line of code
>>> import re
>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')
>>> ['1234']
result will receive list...

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

You can use re module for that:
>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

In python, extracting substring form string can be done using findall method in regular expression (re) module.
>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'
print(text[text.index(left)+len(left):text.index(right)])
Gives
string

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
You could do the same with re.sub function using the same regex.
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
def FindSubString(strText, strSubString, Offset=None):
try:
Start = strText.find(strSubString)
if Start == -1:
return -1 # Not Found
else:
if Offset == None:
Result = strText[Start+len(strSubString):]
elif Offset == 0:
return Start
else:
AfterSubString = Start+len(strSubString)
Result = strText[AfterSubString:AfterSubString + int(Offset)]
return Result
except:
return -1
# Example:
Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"
print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")
print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")
print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))
# Your answer:
Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"
AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0)
print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

Using PyParsing
import pyparsing as pp
word = pp.Word(pp.alphanums)
s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
print(match)
which yields:
[['1234']]

One liner with Python 3.8 if text is guaranteed to contain the substring:
text[text.find(start:='AAA')+len(start):text.find('ZZZ')]

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:
regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'
I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.
Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.

also, you can find all combinations in the bellow function
s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
word_places = []
i=0
while True:
word_place = text.find(word,i)
i+=len(word)+word_place
if i>=len(text):
break
if word_place<0:
break
word_places.append(word_place)
return word_places
def find_all_combination(text,start,end):
start_places = find_all_places(text,start)
end_places = find_all_places(text,end)
combination_list = []
for start_place in start_places:
for end_place in end_places:
print(start_place)
print(end_place)
if start_place>=end_place:
continue
combination_list.append(text[start_place:end_place])
return combination_list
find_all_combination(s,"Part","Part")
result:
['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']

In case you want to look for multiple occurences.
content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
spos = c.find('_Suffix')
if spos!=-1:
strings.append( c[:spos])
print( strings )
Or more quickly :
strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]

Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.
def find_substring(string, start, end):
len_until_end_of_first_match = string.find(start) + len(start)
after_start = string[len_until_end_of_first_match:]
return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]

Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :
string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []
for char in string:
if char in numbersList: output.append(char)
print(f"output: {''.join(output)}")
### output: 1234

Typescript. Gets string in between two other strings.
Searches shortest string between prefixes and postfixes
prefixes - string / array of strings / null (means search from the start).
postfixes - string / array of strings / null (means search until the end).
public getStringInBetween(str: string, prefixes: string | string[] | null,
postfixes: string | string[] | null): string {
if (typeof prefixes === 'string') {
prefixes = [prefixes];
}
if (typeof postfixes === 'string') {
postfixes = [postfixes];
}
if (!str || str.length < 1) {
throw new Error(str + ' should contain ' + prefixes);
}
let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);
let value = str.substring(start.pos + start.sub.length, end.pos);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
while (true) {
try {
start = this.indexOf(value, prefixes);
} catch (e) {
break;
}
value = value.substring(start.pos + start.sub.length);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
}
return value;
}

a simple approach could be the following:
string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]

One liners that return other string if there was no match.
Edit: improved version uses next function, replace "not-found" with something else if needed:
import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )
My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:
import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Recursively capture patterns in regex - Python - python

Related

Struggling with Regex for adjacent letters differing by case

Parsing text files with "magic" values

Splitting contents in textfile

Python Regular Expressions; Having 0 match either 0 or 1

How to extract the substring between two markers?

Categories

Resources