regex split by parenthesis but not all parenthesis - python

I am trying to split a string containing open and close parenthesis but want to exclude those parenthesis that have a substring right before them.
In the following example:
a = 'abc (xyz pqr) qwe ew (kjlk asd) ue(aad) kljl'
I want to have a list like:
['abc', 'xyz pqr', 'qwe ew', 'kjlk asd', 'ue(aad)', 'kljl']
So I want to keep ue(aad) and do not split by (aad)
I have tried:
y = [x.strip() for x in re.split(r"[^ue()][()]", a) if x.strip()]

Try this:
import re
a = 'abc (xyz pqr) qwe ew (kjlk asd) ue(aad) kljl'
y = [x.strip() for x in re.split(r' (\S*\(.*?\))', a) if x != '']
for i in range(len(y)):
if y[i][0] == '(' and y[i][-1] == ')':
y[i] = y[i].strip('()')
print(y) # => ['abc', 'xyz pqr', 'qwe ew', 'kjlk asd', 'ue(aad)', 'kljl']
The RegEx (\S*\(.*?\)) will match any of the parentheses and any preceding strings, then the loop removes surrounding parentheses from matches that have no preceding strings.

For your example data, you could use capture groups to keep the result after splitting. In the pattern, capture non whitespace chars except parenthesis before or after the part with parenthesis.
In the list comprehension, first check for x and then you can test again for x.strip()
Note that this does not take any nested/balanced parenthesis into account.
Explanation
([^\s()]+\([^()]*\)) Capture group 1, match 1+ non whitespace chars before matching from (...)
| Or
(\([^()]*\)[^\s()]+) Capture group 2, match 1+ non whitespace chars after matching from (...)
| Or
[()] Match either ( or )
See a Python demo and a regex101 demo.
import re
pattern = r"([^\s()]+\([^()]*\))|(\([^()]*\)[^\s()]+)|[()]"
a = 'abc (xyz pqr) qwe ew (kjlk asd) ue(aad) kljl'
y = [x.strip() for x in re.split(pattern, a) if x and x.strip()]
print(y)
Output
['abc', 'xyz pqr', 'qwe ew', 'kjlk asd', 'ue(aad)', 'kljl']

This is a strange way to accomplish this but it works:
import re
a = "abc (xyz pqr) qwe ew (kjlk asd) ue(aad) kljl"
dissub=re.split("\)\s",a)
newlist=[]
for b in dissub:
dasplit=re.split("\s\(",b)
for c in dasplit:
newlist.append(c)
i=0
while i<len(newlist):
dacheck=re.search("\(",newlist[i])
if dacheck:
newlist[i]+=")"
i+=1
print(newlist)

Since the keyword in my case is always known, I was thinking to remove all ue(.*?)s and keep them in a list then split by parenthesis then substitute them.
This way I will be able to split nested parenthesis.
something like:
a = "abc (xyz pqr) qwe ew (kjlk asd) ue(aad) kljl"
ues = re.findall("ue\(.*?\)", a)
j = re.sub("(?<=ue)\(.*?\)", "", a)
y = [x.strip() for x in re.split(r"[()]", j) if x.strip()]
for i in y:
if "ue" in i:
print(re.sub("ue", ues.pop(0), i))
else:
print(i)
Update:
The parenthesis that must be ignored will have a substring stuck to it like ue(). So adding a space before will ignore them.
y = [x.strip() for x in re.split(r"[(?<=\s)][()]", a) if x.strip()]

Related

how remove every thing from a list except words?

I have a list like this:
my_list=["'-\\n'",
"'81\\n'",
"'-\\n'",
"'0913\\n'",
"'Assistant nursing\\n'",
"'0533\\n'",
"'0895 Astronomy\\n'",
"'0533\\n'",
"'Astrophysics\\n'",
"'0532\\n'"]
Is there any way to delete every thing from this list except words?
out put:
my_list=['Assistant nursing',
'Astronomy',
'Astrophysics',]
I know for example if i wanna remove integers in string form i can do this:
no_integers = [x for x in my_list if not (x.isdigit()
or x[0] == '-' and x[1:].isdigit())]
but it dosn't work well enough
The non-regex solution:
You can start by striping off the characters '-\\n, then take only the characters that are alphabets using str.isalpha or a white space, then filter out the sub-strings that are empty ''. You may need to strip off the white space characters in the end, whic
>>> list(filter(lambda x: x!='', (''.join(j for j in i.strip('\'-\\\\n') if j.isalpha() or j==' ').strip() for i in my_list)))
['Assistant nursing', 'Astronomy', 'Astrophysics']
If you want to use regex, you can use the pattern: '([A-Za-z].*?)\\\\n' with re.findall, then filter out the elements that are empty list, finally you can flatten the list
>>> import re
>>> list(filter(lambda x: x, [re.findall('([A-Za-z].*?)\\\\n', i) for i in my_list]))
[['Assistant nursing'], ['Astronomy'], ['Astrophysics']]
with regular expresssions
import re
my_list = # above
# remove \n, -, digits, ' symbols
my_new_list = [re.sub(r"[\d\\n\-']", '', s) for s in my_list]
# remove empty strings
my_new_list = [s for s in my_new_list if s != '']
print(my_new_list)
Output
['Assistat ursig', ' Astroomy', 'Astrophysics']

Removing punctuation from only the beginning and end of each element in a list in python

I'm fairly new to python (and this community), this is a question branching off of a question asked and answered from a long time ago from here
With a list like:
['hello', '...', 'h3.a', 'ds4,']
Creating a new list x with no punctuation (and deleting empty elements) would be:
x = [''.join(c for c in s if c not in string.punctuation) for s in x]
x = [s for s in x if s]
print(x)
Output:
['hello', 'h3a', 'ds4']
However, how would I be able to remove all punctuation only from the beginning and end of each element? I mean, to instead output this:
['hello', 'h3.a', 'ds4']
In this case, keeping the period in the h3a but removing the comma at the end of the ds4.
You could use regular expressions. re.sub() can replace all matches of a regex with a string.
import re
X = ['hello', '.abcd.efg.', 'h3.a', 'ds4,']
X_rep = [re.sub(r"(^[^\w]+)|([^\w]+$)", "", x) for x in X]
print(X_rep)
# Output: ['hello', 'abcd.efg', 'h3.a', 'ds4']
Explanation of regex: Try it
(^[^\w]+):
^: Beginning of string
[^\w]+: One or more non-word characters
|: The previous expression, or the next expression
([^\w]+$):
[^\w]+: One or more non-word characters
$: End of string
x = ['hello', '...', 'h3.a', 'ds4,']
x[0] = [''.join(c for c in s if c not in string.punctuation) for s in x][0]
x[(len(x)-1)] = [''.join(c for c in s if c not in string.punctuation) for s in x][(len(x)-1)]
x = [s for s in x if s]
print(x)

Regex in re To Match Word with More Than 2 Numbers

I want to use re to find words that have more than two numbers anywhere in the word, so I want to return:
aaabbbccc123
but not:
aaabbbccc12
The only trick is that the numbers should be able to appear anywhere:
aaa1bbb2ccc3 aaa12bbbccc3, etc.
You don't need re for this:
import string
len([x for x in "aaabbbccc123" if x in string.digits]) > 2 # True
len([x for x in "aaabb1bccc2" if x in string.digits]) > 2 # False
len([x for x in "aa1abb2bccc3" if x in string.digits]) > 2 # False
You could use function re.findall to find all numbers in string with pattern \d. re.findall will return an array with numbers founded. Then, you use function len to get length of array.
I also test result on Python.
import re
string = "aaabbbccc123"
resultStringOne = re.findall(r"\d", string)
if len(resultStringOne) > 2:
print("resultStringOne")
print(resultStringOne)
string = "aaabbbccc12"
resultStringTwo = re.findall(r"\d", string)
if len(resultStringTwo) > 2:
print("resultStringTwo")
print(resultStringTwo)
string = "aaa1bbb2ccc3 aaa12bbbccc3"
resultStringThree = re.findall(r"\d", string)
if len(resultStringThree) > 2:
print("resultStringThree")
print(resultStringThree)
Result
resultStringOne
['1', '2', '3']
resultStringThree
['1', '2', '3', '1', '2', '3']
import re
arr = ["aaabbbccc123","aaabbbccc12","aaa1bbb2ccc3 aaa12bbbccc3"]
for x in arr:
m = re.match(r"(.*\d.*\d.*\d.*)",x)
if(m) : print(m.group(1))
result
aaabbbccc123
aaa1bbb2ccc3 aaa12bbbccc3
This can be done using a single regex as this:
^(?=(?:\d?[a-zA-Z]){3})(?:(?:[a-zA-Z]{3,})?\d){3}
RegEx Demo
RegEx Details:
^: Start
(?=(?:\d?[a-zA-Z]){3}): Lookahead to assert at least 3 letters
(?:(?:[a-zA-Z]{3,})?\d){3}: Match at least 3 instances of this group. Inside the group we match 3+ letters followed by a single digit.
Code:
>>> import re
>>> reg = re.compile(r'^(?=(?:\d?[a-zA-Z]){3})(?:(?:[a-zA-Z]{3,})?\d){3}')
>>> arr = ['aaabbbccc123', 'aaa1bbb2ccc3', 'aaa12bbbccc3', 'aaabbbccc12', 'aaa3bbbccc1', '123']
>>> for el in arr:
... print(reg.findall(el))
...
['aaabbbccc123']
['aaa1bbb2ccc3']
['aaa12bbbccc3']
[]
[]
[]
For matching words with two or more consecutive digits, which may be helpful to someone arriving here from a search ...
(?<=\s)[a-zA-Z0-9]*[0-9]{2}.*?\b
The Thân examples in his question are included in a demo of this at regex101, which anyone can modify for a new URL with their own improvements using CTRL-s, no login required.
It's for a string intended for use in re.sub(). It matches aaabbbccc123 so long as there's a leading space. It matches 12a and a12 but not 1a2
This just addresses consecutive digits.
The following, \b.*?\d\d.*?\b fails by catching preceding words. Why?
[Updated]

Python Combining f-string with r-string and curly braces in regex

Given a single word (x); return the possible n-grams that can be found in that word.
You can modify the n-gram value according as you want;
it is in the curly braces in the pat variable.
The default n-gram value is 4.
For example; for the word (x):
x = 'abcdef'
The possible 4-gram are:
['abcd', 'bcde', 'cdef']
def ngram_finder(x):
pat = r'(?=(\S{4}))'
xx = re.findall(pat, x)
return xx
The Question is:
How to combine the f-string with the r-string in the regex expression, using curly braces.
You can use this string to combine the n value into your regexp, using double curly brackets to create a single one in the output:
fr'(?=(\S{{{n}}}))'
The regex needs to have {} to make a quantifier (as you had in your original regex {4}). However f strings use {} to indicate an expression replacement so you need to "escape" the {} required by the regex in the f string. That is done by using {{ and }} which in the output create { and }. So {{{n}}} (where n=4) generates '{' + '4' + '}' = '{4}' as required.
Complete code:
import re
def ngram_finder(x, n):
pat = fr'(?=(\S{{{n}}}))'
return re.findall(pat, x)
x = 'abcdef'
print(ngram_finder(x, 4))
print(ngram_finder(x, 5))
Output:
['abcd', 'bcde', 'cdef']
['abcde', 'bcdef']

Regex: select all groups of two (hashtag) words next to each other

I've got an example string:
#water #atlantic ocean #sea
and I want to use regex to select all groups of two hashtag words next to each other. which would return:
[[['#water']['#atlantic ocean']], [['#atlantic ocean']['#sea']]]
I'm at a loss as to how to do this regex. The closest I've gotten is:
([#][A-Za-z\s]+\s?)
which just yields the following (in python):
>>> regex.findall(string)
[u'#water ', u'#atlantic ocean ', u'#sea']
I've tried putting a {2} at the end but that seems to not match pairs. Any ideas at all on how to achieve this?
To me it feels more intuitive to split on # (or space followed by hash) than to use complicated regex:
import re
expr = "#water #atlantic ocean #sea"
groups = filter(None, re.split(r' ?#', expr))
# another option is to use a split that doesn't require regex at all:
# groups = filter(None, map(str.strip, expr.split("#")))
res = []
for i, itm in enumerate(groups):
if i < len(groups)-1:
res.append(["#"+itm, "#"+groups[i + 1]])
print res # [['#water', '#atlantic ocean'], ['#atlantic ocean', '#sea']]
You need to use positive lookahead in-order to do a overlapping match.
(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?\s#[A-Za-z]+(?:\s[A-Za-z]+)?))
DEMO
>>> import re
>>> s = "#water #atlantic ocean #sea"
>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?\s#[A-Za-z]+(?:\s[A-Za-z]+)?))', s)
>>> print m
['#water #atlantic ocean', '#atlantic ocean #sea']
OR
>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)?)\s(#[A-Za-z]+(?:\s[A-Za-z]+)?))', s)
>>> print m
[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]
Use * instead of ? after the non-capturing groups, if the following words would occur zero or more times.
>>> m = re.findall(r'(?=(#[A-Za-z]+(?:\s[A-Za-z]+)*)\s(#[A-Za-z]+(?:\s[A-Za-z]+)*))', s)
>>> print m
[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]
(#[^#]*)(?=[^#]*(#[^#]*))
Try this.This will give the required groups.Grab the captures.
x="#water #atlantic ocean #sea"
print re.findall(r"(#[^#]*)(?=[^#]*(#[^#]*))",x)
Output:[('#water', '#atlantic ocean'), ('#atlantic ocean', '#sea')]
See demo.
http://regex101.com/r/rQ6mK9/36

Categories

Resources