Regular expression using finder

Regular expression using finder - python

I trying to figure out this expression:
p = re.compile ("[I need this]")
for m in p.finditer('foo, I need this, more foo'):
print m.start(), m.group()
I need to understand why I'm getting "e" in count 22
and re-write this correctly.

[] denotes a character class, that is, in your case, [I need this] would stand for: match a character that is one of: I, n, e, d, t, h, i, s, and, (maybe) a space. It is equivalent to [Inedthis ]. If you would like to match the whole phrase, omit the brackets. If you want to match the brackets, as well, escape them: \[I ... \].

By using [], you are searching for the character class [ Idehinst], that is the set of the characters ' ', 'I', 'd', 'e', 'h', 'i', 'n', 's', 't'.
Using (...) matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.
If you want to search for the group: (I need this).
>>> import re
>>> p = re.compile ("(I need this)")
>>> for m in p.finditer('foo, I need this, more foo'):
... print m.start(), m.group()
...
5 I need this
For more information, see 7.2.1. Regular Expression Syntax in the official documentation.

Related

Drop Duplicate Substrings from String with NO Spaces

Given a Pandas DF column that looks like this:
...how can I turn it into this:
XOM
ZM
AAPL
SOFI
NKLA
TIGR
Although these strings appear to be 4 characters in length maximum, I can't rely on that, I want to be able to have a string like ABCDEFGHIJABCDEFGHIJ and still be able to turn it into ABCDEFGHIJ in one column calculation. Preferably WITHOUT for looping/iterating through the rows.

You can use regex pattern like r'\b(\w+)\1\b' with str.extract like below:
df = pd.DataFrame({'Symbol':['ZOMZOM', 'ZMZM', 'SOFISOFI',
'ABCDEFGHIJABCDEFGHIJ', 'NOTDUPLICATED']})
print(df['Symbol'].str.extract(r'\b(\w+)\1\b'))
Output:
0
0 ZOM
1 ZM
2 SOFI
3 ABCDEFGHIJ
4 NaN # <- from `NOTDUPLICATED`
Explanation:
\b is a word boundary
(w+) capture a word
\1 references to captured (w+) of the first group

An alternative approach which does involve iteration, but also regular expressions. Evaluate longest possible substrings first, getting progressively shorter. Use the substring to compile a regex that looks for the substring repeated two or more times. If it finds that, replace it with a single occurrence of the substring.
Does not handle leading or trailing characters. that are not part of the repetition.
When it performs a removal, it returns, breaking the loop. Going with longest substrings first ensures things like 'AAPLAAPL' leave the double A intact.
import re
def remove_repeated(str):
for i in range(len(str)):
substr = str[i:]
pattern = re.compile(f"({substr}){{2,}}")
if pattern.search(str):
return pattern.sub(substr, str)
return str
>>> remove_repeated('abcdabcd')
'abcd'
>>> remove_repeated('abcdabcdabcd')
'abcd'
>>> remove_repeated('aabcdaabcdaabcd')
'aabcd'
If we want to make this more flexible, a helper function to get all of the substrings in a string, starting with the longest, but as a generator expression so we don't have to actually generate more than we need.
def substrings(str):
return (str[i:i+l] for l in range(len(str), 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['hello', 'hell', 'ello', 'hel', 'ell', 'llo', 'he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
But there's no way 'hello' is going to be repeated in 'hello', so we can make this at least somewhat more efficient by looking at only substrings at most half the length of the input string.
def substrings(str):
return (str[i:i+l] for l in range(len(str)//2, 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
Now, a little tweak to the original function:
def remove_repeated(str):
for s in substrings(str):
pattern = re.compile(f"({s}){{2,}}")
if pattern.search(str):
return pattern.sub(s, str)
return str
And now:
>>> remove_repeated('AAPLAAPL')
'AAPL'
>>> remove_repeated('fooAAPLAAPLbar')
'fooAAPLbar'

How to split a string between all chars and int in Python

For example I want to convert "2pL11H10K" into [2, p, L, 11, H, 10, K]

Use regular expression.
Example
your_string = "2pL11H10K"
items = re.findall(r'[A-Za-z]|\d+', your_string)
print(items)
then you got
['2', 'p', 'L', '11', 'H', '10', 'K']

Regular expressions are necessary and there have already been some quality answers given, but you will also need to convert the numbers from str() to int(). This can also be achieved using regular expressions, for example with [0-9]+ to identify one or more digits.

You can implement the logic for this in a loop checking the if the previous element is a digit or char if you do not wish to import any modules. However regex would likely be the most elegant solution.
result = []
for e in string:
if result:
if result[-1].isdigit() and e.isdigit():
result[-1] = result[-1] + e
else:
result.append(e)
else:
result.append(e)

Find all strings in nested brackets

How do i find string in nested brackets
Lets say I have a string
uv(wh(x(yz))
and I want to find all string in brackets (so wh, x, yz)
import re
s="uuv(wh(x(yz))"
regex = r"(\(\w*?\))"
matches = re.findall(regex, s)
The above code only finds yz
Can I modify this regex to find all matches?

To get all properly parenthesized text:
import re
def get_all_in_parens(text):
in_parens = []
n = "has something to substitute"
while n:
text, n = re.subn(r'\(([^()]*)\)', # match flat expression in parens
lambda m: in_parens.append(m.group(1)) or '', text)
return in_parens
Example:
>>> get_all_in_parens("uuv(wh(x(yz))")
['yz', 'x']
Note: there is no 'wh' in the result due to the unbalanced paren.
If the parentheses are balanced; it returns all three nested substrings:
>>> get_all_in_parens("uuv(wh(x(yz)))")
['yz', 'x', 'wh']
>>> get_all_in_parens("a(b(c)de)")
['c', 'bde']

Would a string split work instead of a regex?
s='uv(wh(x(yz))'
match=[''.join(x for x in i if x.isalpha()) for i in s.split('(')]
>>>print(match)
['uv', 'wh', 'x', 'yz']
>>> match.pop(0)
You could pop off the first element because if it was contained in a parenthesis, the first position would be blank, which you wouldn't want and if it wasn't blank that means it wasn't in the parenthesis so again, you wouldn't want it.
Since that wasn't flexible enough something like this would work:
def match(string):
unrefined_match=re.findall('\((\w+)|(\w+)\)', string)
return [x for i in unrefined_match for x in i if x]
>>> match('uv(wh(x(yz))')
['wh', 'x', 'yz']
>>> match('a(b(c)de)')
['b', 'c', 'de']

Using regex a pattern such as this might potentially work:
\((\w{1,})
Result:
['wh', 'x', 'yz']
Your current pattern escapes the ( ) and doesn't treat them as a capture group.

Well if you know how to covert from PHP regex to Python , then you can use this
\(((?>[^()]+)|(?R))*\)

How to take out symbols from string with regex

I am trying to extract some useful symbols for me from the strings, using regex and Python 3.4.
For example, I need to extract any lowercase letter + any uppercase letter + any digit. The order is not important.
'adkkeEdkj$4' --> 'aE4'
'4jdkg5UU' --> 'jU4'
Or, maybe, a list of the symbols, e.g.:
'adkkeEdkj$4' --> ['a', 'E', 4]
'4jdkg5UU' --> ['j', 'U', 4]
I know that it's possible to match them using:
r'(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])'
Is it possible to get them using regex?

You can get those values by using capturing groups in the look-aheads you have:
import re
p = re.compile('^(?=[^a-z]*([a-z]))(?=[^A-Z]*([A-Z]))(?=[^0-9]*([0-9]))', re.MULTILINE)
test_str = "adkkeEdkj$4\n4jdkg5UU"
print(re.findall(p, test_str))
See demo
The output:
[('a', 'E', '4'), ('j', 'U', '4')]
Note I have edited the look-aheads to include contrast classes for better performance, and the ^ anchor is important here, too.

How can I avoid those empty strings caused by preceding or trailing whitespaces?

>>> import re
>>> re.split(r'[ "]+', ' a n" "c ')
['', 'a', 'n', 'c', '']
When there is preceding or trailing whitespace, there will be empty strings after splitting.
How can I avoid those empty strings? Thanks.

The empty values are the things between the splits. re.split() is not the right tool for the job.
I recommend matching what you want instead.
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
If you must use split, you could use a list comprehension and filter it directly.
>>> [x for x in re.split(r'[ "]+', ' a n" "c ') if x != '']
['a', 'n', 'c']

That's what re.split is supposed to do. You're asking it to split the string on any runs of whitespace or quotes; if it didn't return an empty string at the start, you wouldn't be able to distinguish that case from the case with no preceding whitespace.
If what you're actually asking for is to find all runs of non-whitespace-or-quote characters, just write that:
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']

I like abarnert solution.
However, you can also do (maybe not a pythonic way):
myString.strip()
Before your split (or etc).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression using finder - python

I trying to figure out this expression: p = re.compile ("[I need this]") for m in p.finditer('foo, I need this, more foo'): print m.start(), m.group() I need to understand why I'm getting "e" in count 22 and re-write this correctly.

Related

Drop Duplicate Substrings from String with NO Spaces

How to split a string between all chars and int in Python

Find all strings in nested brackets

How to take out symbols from string with regex

How can I avoid those empty strings caused by preceding or trailing whitespaces?

Categories

Resources