Find all strings in nested brackets - python

How do i find string in nested brackets
Lets say I have a string
uv(wh(x(yz))
and I want to find all string in brackets (so wh, x, yz)
import re
s="uuv(wh(x(yz))"
regex = r"(\(\w*?\))"
matches = re.findall(regex, s)
The above code only finds yz
Can I modify this regex to find all matches?

To get all properly parenthesized text:
import re
def get_all_in_parens(text):
in_parens = []
n = "has something to substitute"
while n:
text, n = re.subn(r'\(([^()]*)\)', # match flat expression in parens
lambda m: in_parens.append(m.group(1)) or '', text)
return in_parens
Example:
>>> get_all_in_parens("uuv(wh(x(yz))")
['yz', 'x']
Note: there is no 'wh' in the result due to the unbalanced paren.
If the parentheses are balanced; it returns all three nested substrings:
>>> get_all_in_parens("uuv(wh(x(yz)))")
['yz', 'x', 'wh']
>>> get_all_in_parens("a(b(c)de)")
['c', 'bde']

Would a string split work instead of a regex?
s='uv(wh(x(yz))'
match=[''.join(x for x in i if x.isalpha()) for i in s.split('(')]
>>>print(match)
['uv', 'wh', 'x', 'yz']
>>> match.pop(0)
You could pop off the first element because if it was contained in a parenthesis, the first position would be blank, which you wouldn't want and if it wasn't blank that means it wasn't in the parenthesis so again, you wouldn't want it.
Since that wasn't flexible enough something like this would work:
def match(string):
unrefined_match=re.findall('\((\w+)|(\w+)\)', string)
return [x for i in unrefined_match for x in i if x]
>>> match('uv(wh(x(yz))')
['wh', 'x', 'yz']
>>> match('a(b(c)de)')
['b', 'c', 'de']

Using regex a pattern such as this might potentially work:
\((\w{1,})
Result:
['wh', 'x', 'yz']
Your current pattern escapes the ( ) and doesn't treat them as a capture group.

Well if you know how to covert from PHP regex to Python , then you can use this
\(((?>[^()]+)|(?R))*\)

Related

Split a string after multiple delimiters and include it

Hello I'm trying to split a string without removing the delimiter and it can have multiple delimiters.
The delimiters can be 'D', 'M' or 'Y'
For example:
>>>string = '1D5Y4D2M'
>>>re.split(someregex, string) #should ideally return
['1D', '5Y', '4D', '2M']
To keep the delimiter I use Python split() without removing the delimiter
>>> re.split('([^D]+D)', '1D5Y4D2M')
['', '1D', '', '5Y4D', '2M']
For multiple delimiters I use In Python, how do I split a string and keep the separators?
>>> re.split('(D|M|Y)', '1D5Y4D2M')
['1', 'D', '5', 'Y', '4', 'D', '2', 'M', '']
Combining both doesn't quite make it.
>>> re.split('([^D]+D|[^M]+M|[^Y]+Y)', string)
['', '1D', '', '5Y4D', '', '2M', '']
Any ideas?
I'd use findall() in your case. How about:
re.findall(r'\d+[DYM]', string
Which will result in:
['1D', '5Y', '4D', '2M']
(?<=(?:D|Y|M))
You need 0 width assertion split.Can be done using regex module python.
See demo.
https://regex101.com/r/aKV13g/1
You can split at the locations right after D, Y or M but not at the end of the string with
re.split(r'(?<=[DYM])(?!$)', text)
See the regex demo. Details:
(?<=[DYM]) - a positive lookbehind that matches a location that is immediately preceded with D or Y or M
(?!$) - a negative lookahead that fails the match if the current position is the string end position.
Note
In the current scenario, (?<=[DYM]) can be used instead of a more verbose (?<=D|Y|M) since all alternatives are single characters. If you have multichar delimiters, you would have to use a non-capturing group, (?:...), with lookbehind alternatives inside it. For example, to separate right after Y, DX and MZB you would use (?:(?<=Y)|(?<=DX)|(?<=MZB)). See Python Regex Engine - "look-behind requires fixed-width pattern" Error
I think it will work fine without regex or split
time complexity O(n)
string = '1D5Y4D2M'
temp=''
res = []
for x in string:
if x=='D':
temp+='D'
res.append(temp)
temp=''
elif x=='M':
temp+='M'
res.append(temp)
temp=''
elif x=='Y':
temp+='Y'
res.append(temp)
temp=''
else:
temp+=x
print(res)
using translate
string = '1D5Y4D2M'
delimiters = ['D', 'Y', 'M']
result = string.translate({ord(c): f'{c}*' for c in delimiters}).strip('.*').split('*')
print(result)
>>> ['1D', '5Y', '4D', '2M']

Drop Duplicate Substrings from String with NO Spaces

Given a Pandas DF column that looks like this:
...how can I turn it into this:
XOM
ZM
AAPL
SOFI
NKLA
TIGR
Although these strings appear to be 4 characters in length maximum, I can't rely on that, I want to be able to have a string like ABCDEFGHIJABCDEFGHIJ and still be able to turn it into ABCDEFGHIJ in one column calculation. Preferably WITHOUT for looping/iterating through the rows.
You can use regex pattern like r'\b(\w+)\1\b' with str.extract like below:
df = pd.DataFrame({'Symbol':['ZOMZOM', 'ZMZM', 'SOFISOFI',
'ABCDEFGHIJABCDEFGHIJ', 'NOTDUPLICATED']})
print(df['Symbol'].str.extract(r'\b(\w+)\1\b'))
Output:
0
0 ZOM
1 ZM
2 SOFI
3 ABCDEFGHIJ
4 NaN # <- from `NOTDUPLICATED`
Explanation:
\b is a word boundary
(w+) capture a word
\1 references to captured (w+) of the first group
An alternative approach which does involve iteration, but also regular expressions. Evaluate longest possible substrings first, getting progressively shorter. Use the substring to compile a regex that looks for the substring repeated two or more times. If it finds that, replace it with a single occurrence of the substring.
Does not handle leading or trailing characters. that are not part of the repetition.
When it performs a removal, it returns, breaking the loop. Going with longest substrings first ensures things like 'AAPLAAPL' leave the double A intact.
import re
def remove_repeated(str):
for i in range(len(str)):
substr = str[i:]
pattern = re.compile(f"({substr}){{2,}}")
if pattern.search(str):
return pattern.sub(substr, str)
return str
>>> remove_repeated('abcdabcd')
'abcd'
>>> remove_repeated('abcdabcdabcd')
'abcd'
>>> remove_repeated('aabcdaabcdaabcd')
'aabcd'
If we want to make this more flexible, a helper function to get all of the substrings in a string, starting with the longest, but as a generator expression so we don't have to actually generate more than we need.
def substrings(str):
return (str[i:i+l] for l in range(len(str), 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['hello', 'hell', 'ello', 'hel', 'ell', 'llo', 'he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
But there's no way 'hello' is going to be repeated in 'hello', so we can make this at least somewhat more efficient by looking at only substrings at most half the length of the input string.
def substrings(str):
return (str[i:i+l] for l in range(len(str)//2, 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
Now, a little tweak to the original function:
def remove_repeated(str):
for s in substrings(str):
pattern = re.compile(f"({s}){{2,}}")
if pattern.search(str):
return pattern.sub(s, str)
return str
And now:
>>> remove_repeated('AAPLAAPL')
'AAPL'
>>> remove_repeated('fooAAPLAAPLbar')
'fooAAPLbar'

How to convert a string to a list if the string has wild characters for a group of characters like [] or {}, ()

I have a string of this sort
s = 'a,s,[c,f],[f,t]'
I want to convert this to a list
S = ['a','s',['c','f'],['f','t']]
I tried using strip()
d = s.strip('][').split(',')
But it is not giving me the desired output:
output = ['a', 's', '[c', 'f]', '[f', 't']
You could use ast.literal_eval(), having first enclosed each element in quotes:
>>> qs = re.sub(r'(\w+)', r'"\1"', s) # add quotes
>>> ast.literal_eval('[' + qs + ']') # enclose in brackets & safely eval
['a', 's', ['c', 'f'], ['f', 't']]
You may need to tweak the regex if your elements can contain non-word characters.
This only works if your input string follows Python expression syntax or is sufficiently close to be mechanically converted to Python syntax (as we did above by adding quotes and brackets). If this assumption does not hold, you might need to look into using a parsing library. (You could also hand-code a recursive descent parser, but that'll probably be more work to do correctly than just using a parsing library.)
Alternative to ast.literal_eval you can use the json package with more or less the same restrictions of NPE's answer:
import re
import json
qs = re.sub(r'(\w+)', r'"\1"', s) # add quotes
ls = json.loads('[' + qs + ']')
print(ls)
# ['a', 's', ['c', 'f'], ['f', 't']]

How can I avoid those empty strings caused by preceding or trailing whitespaces?

>>> import re
>>> re.split(r'[ "]+', ' a n" "c ')
['', 'a', 'n', 'c', '']
When there is preceding or trailing whitespace, there will be empty strings after splitting.
How can I avoid those empty strings? Thanks.
The empty values are the things between the splits. re.split() is not the right tool for the job.
I recommend matching what you want instead.
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
If you must use split, you could use a list comprehension and filter it directly.
>>> [x for x in re.split(r'[ "]+', ' a n" "c ') if x != '']
['a', 'n', 'c']
That's what re.split is supposed to do. You're asking it to split the string on any runs of whitespace or quotes; if it didn't return an empty string at the start, you wouldn't be able to distinguish that case from the case with no preceding whitespace.
If what you're actually asking for is to find all runs of non-whitespace-or-quote characters, just write that:
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
I like abarnert solution.
However, you can also do (maybe not a pythonic way):
myString.strip()
Before your split (or etc).

Regular expression using finder

I trying to figure out this expression:
p = re.compile ("[I need this]")
for m in p.finditer('foo, I need this, more foo'):
print m.start(), m.group()
I need to understand why I'm getting "e" in count 22
and re-write this correctly.
[] denotes a character class, that is, in your case, [I need this] would stand for: match a character that is one of: I, n, e, d, t, h, i, s, and, (maybe) a space. It is equivalent to [Inedthis ]. If you would like to match the whole phrase, omit the brackets. If you want to match the brackets, as well, escape them: \[I ... \].
By using [], you are searching for the character class [ Idehinst], that is the set of the characters ' ', 'I', 'd', 'e', 'h', 'i', 'n', 's', 't'.
Using (...) matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.
If you want to search for the group: (I need this).
>>> import re
>>> p = re.compile ("(I need this)")
>>> for m in p.finditer('foo, I need this, more foo'):
... print m.start(), m.group()
...
5 I need this
For more information, see 7.2.1. Regular Expression Syntax in the official documentation.

Categories

Resources