Regex multiple parenthesis and remove one with specific pattern - python

I have multiple parentheses and want to remove the parentheses that have at least one number in.
I have tried the following. However, since it is greedy, it removes the first open parenthesis to the last close parenthesis. I have also tried to destroy the greedy feature by excluding an open parenthesis but did not work.
names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
data = pd.DataFrame(names, columns = ['name'])
print(data.name.str.replace("\(.*?\d+.*?\)", ""))
# Output: ['d)', 'd(ab)', 'd']
print(data.name.str.replace("\((?!\().*[\d]+(?!\().*\)",""))
# Output: ['d(', 'd', 'd']
# desired output: ['d', 'd(ab)', 'd(ab)']

This regex seems valid: \([^)\d]*?\d+[^)]*?\)+
>>> pattern = '\([^)\d]*?\d+[^)]*?\)+'
>>> names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
>>> [re.sub(pattern, '', x) for x in names]
['d', 'd(ab)', 'd(ab)']
I don't know if there are more complex cases but for those that you've supplied and similar, it should do the trick.

Although Python does not support recursive regex, you can enable
it by installing regex module with:
pip install regex
Then you can say something like:
import regex
names = ['d((123))', 'd(1a)(ab)', 'd(1a)(ab)(123)']
pattern = r'\((?:[^()]*?\d[^()]*?|(?R))+\)'
print ([regex.sub(pattern, '', x) for x in names])
Output:
['d', 'd(ab)', 'd(ab)']

Related

splitting txt based on ':' but excluding the timestamp in python

05-23 14:14:53.275 A:B:C
in the above case i am trying to split the txt based on : using line.split(':') and following o/p should come as
['05-23 14:14:53.275','A','B','C']
but instead The o/p came is
['05-23 14','14','53.275','A','B','C']
it is also splitting the timestamp.
how do i exclude that from splitting
You are also splitting on the last space. An easy solution is to split on the last space and then split the second group:
s = '05-23 14:14:53.275 A:B:C'
front, back = s.rsplit(maxsplit=1)
[front] + back.split(':')
# ['05-23 14:14:53.275', 'A', 'B', 'C']
Split the line on whitespaces once, starting from the right:
parts = line.rsplit(maxsplit=1)
Combine the first two parts and the last one split by the colons:
parts[:1] + parts[-1].rsplit(":")
['05-23 14:14:53.275', 'A', 'B', 'C']
Just for fun of using walrus:
>>> s = '05-23 14:14:53.275 A:B:C'
>>> [(temp := s.rsplit(maxsplit=1))[0], *temp[1].split(':')]
['05-23 14:14:53.275', 'A', 'B', 'C']
I would suggest you use regex to split this.
([-:\s\d.]*)\s([\w:]*)
Try it in some regex online to see how it is split. Once you get your regex right, you cna use the groups to select which part you want and work on that.
import re
str = '05-23 14:14:53.275 A:B:C'
regex = '([-:\s\d.]*)\s([\w:]*)'
groups = re.compile(regex).match(str).groups()
timestamp = groups[0]
restofthestring = groups[1]
# Now you can split the second part using split
splits = restofthestring.split(':')

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?
This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')
If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d
You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.
Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

Find all matches between delimiters [duplicate]

This question already has an answer here:
Regex - capture all repeated iteration
(1 answer)
Closed 3 years ago.
I'm trying to find all single letters between ! and !.
For example, the string !abc! should return three matches: a, b, c.
I tried the regex !([a-z])+!, but it returns just one match: c. !(([a-z])+)! also doesn't help.
import re
s = '!abc!'
print(re.findall(r'!([a-z])+!', s))
UPD: Needless to say, it should also work with the strings like !abcdef!. The number of characters between the delimiters is not fixed.
You should place the capture group around ([a-z]+), including the entire repeated term. Then, you may use list() to convert the match into a list of individual letters.
s = '!abc!'
result = re.findall(r'!([a-z]+)!', s)
print list(result[0])
(?<=!.*)\w(?=.*!) Should return the result you want, each character individually
Okay, I'm answering my own question. Found a solution, thanks to this answer.
First off, an alternative regex module is needed, because the regex below uses the \G anchor.
Here is the regex:
(?:!|\G(?!^))\K([a-z])(?=(?:[a-z])*!)
Works like a charm.
import regex
s = '!abcdef!'
print(regex.findall(r'(?:!|\G(?!^))\K([a-z])(?=(?:[a-z])*!)', s))
Prints ['a', 'b', 'c', 'd', 'e', 'f'].
I look into your problem, please follow the following logic inside your expression
s = '!abc!'
print(re.findall(r'!([a-z])([a-z])([a-z])!',s))
each character is divided into groups to get them individually in an array.

Regex for parameter of a function

I'm fairly inexperienced with regex, but I need one to match the parameter of a function. This function will appear multiple times in the string, and I would like to return a list of all parameters.
The regex must match:
Alphanumeric and underscore
Inside quotes directly inside parenthesis
After a specific function name
Here's an example string:
Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])
and I would like this as output:
['_xyx', 'y', 'z_', x_1', 'x', 'y']
What I have so far:
(?<=Atom\(')[\w|_]*
I'm calling this with:
import re
s = "Generic3(p, [Generic3(g, [Atom('x'), Atom('y'), Atom('z')]), Atom('x'), Generic2(f, [Atom('x'), Atom('y')])])"
print(re.match(r"(?<=Atom\(')[\w|_]*", s))
But this just prints None. I feel like I'm nearly there, but I'm missing something, maybe on the Python side to actually return the matches.
Your regex is close, you need to add \W character to find the underscore:
s = "Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])"
r = "(?<=Atom\()\W\w+"
final_data = re.findall(r, s)
You can also try this:
import re
s = "Generic3(p, [Generic3(g, [Atom('_xyx'), Atom('y'), Atom('z_')]), Atom('x_1'), Generic2(f, [Atom('x'), Atom('y')])])"
new_data = re.findall("Atom\('(.*?)'\)", s)
Output:
['_xyx', 'y', 'z_', 'x_1', 'x', 'y']

Regular expression in python - help needed

Like many other people posting questions here, I recently started programming in Python.
I'm faced with a problem trying to define the regular expression to extract a variable name (I have a list of variable names saved in the list) from a string.
I am parsing part of the code which I take line by line from a file.
I make a list of variables,:
>>> variable_list = ['var1', 'var2', 'var4_more', 'var3', 'var1_more']
What I want to do is to define re.compile with something that won't say that it found two var1; I want to make an exact match. According to the example above, var should match nothing, var1 should match only the first element of the list.
I presume that the answer may be combining regex with negation of other regex, but I am not sure how to solve this problem.
OK, I have noticed that I missed one important thing. Variable list is gathered from a string, so it's possible to have a space before the var name, or sign after.
More accurate variable_list would be something like
>>> variable_list = [' var1;', 'var1 ;', 'var1)', 'var1_more']
In this case it should recognize first 3, but not the last one as a var1.
It sounds like you just need to anchor your regex with ^ and $, unless I'm not understanding you properly:
>>> mylist = ['var1', 'var2', 'var3_something', 'var1_text', 'var1var1']
>>> import re
>>> r = re.compile(r'^var1$')
>>> matches = [item for item in mylist if r.match(item)]
>>> print matches
['var1']
So ^var1$ will match exactly var1, but not var1_text or var1var1. Is that what you're after?
I suppose one way to handle your edit would be with ^\W*var1\W*$ (where var1 is the variable name you want). The \W shorthand character class matches anything that is not in the \w class, and \w in Python is basically alphanumeric characters plus the underscore. The * means that this may be matched zero or more times. This results in:
variable_list = [' var1;', 'var1 ;', 'var1)', 'var1_more']
>>> r = re.compile(r'^\W*var1\W*$')
>>> matches = [item for item in variable_list if r.match(item)]
>>> print matches
[' var1;', 'var1 ;', 'var1)']
If you want the name of the variable without the extraneous stuff then you can capture it and extract the first capture group. Something like this, maybe (probably a bit inefficient since the regex runs twice on matched items):
>>> r = re.compile(r'^\W*(var1)\W*$')
>>> matches = [r.match(item).group(1) for item in variable_list if r.match(item)]
>>> print matches
['var1', 'var1', 'var1']
If you are trying to learn about regular expressions, then maybe this is a useful puzzle, but if you want to see whether a certain word is in a list of words why not this:
>>> 'var1' in mylist
True
>>> 'var1 ' in mylist
False
Not to expand too much more on the regex match, but you might consider using the 'filter()' builtin:
filter(function, iterable)
So, using one of the regex's suggested by #eldarerathis:
>>> mylist = ['var1', 'var2', 'var3_something', 'var1_text', 'var1var1']
>>> import re
>>> r = re.compile(r'^var1$')
>>> matches = filter(r.match, mylist)
['var1']
Or using your own match function:
>>> def matcher(value):
>>> ... match statement ...
>>> filter(matcher, mylist)
['var1']
Or negate the regex earlier with a lambda:
>>> filter(lambda x: not r.match(x), mylist)
['var2', 'var3_something', 'var1_text', 'var1var1']

Categories

Resources