Python - string replace between parenthesis with wildcards - python

I am trying to remove some text from a string. What I want to remove could be any of the examples listed below. Basically any combination of uppercase and lowercase, any combination of integers at the end, and any combination of letters at the end. There could also be a space between or not.
(Disk 1)
(Disk 5)
(Disc2)
(Disk 10)
(Part A)
(Pt B)
(disk a)
(CD 7)
(cD X)
I have a method already to get the beginning "(type"
multi_disk_search = [ '(disk', '(disc', '(part', '(pt', '(prt' ]
if any(mds in fileName.lower() for mds in multi_disk_search): #https://stackoverflow.com/a/3389611
for mds in multi_disk_search:
if mds in fileName.lower():
print(mds)
break
That returns (disc for example.
I cannot just split by the parenthesis because there could be other tags in other parenthesis. Also there is no specific order to the tags. The one I am searching for is typically last; however many times it is not.
I think the solution will require regex, but I'm really lost when it comes to that.
I tried this, but it returns something that doesn't make any sense to me.
regex = re.compile(r"\s*\%s\s*" % (mds), flags=re.I) #https://stackoverflow.com/a/20782251/11214013
regex.split(fileName)
newName = regex
print(newName)
Which returns re.compile('\\s*\\(disc\\s*', re.IGNORECASE)
What are some ways to solve this?

Perhaps something like this:
rx = re.compile(r'''
\(
(?: dis[ck] | p(?:a?r)?t )
[ ]?
(?: [a-z]+ | [0-9]+ )
\)''', re.I | re.X)
This pattern uses only basic syntax of regex pattern except eventually the X flag, the Verbose mode (with this one any blank character is ignored in the pattern except when it is escaped or inside a character class). Feel free to read the python manual about the re module. Adding support for CD is let as an exercise.

>>> import re
>>> def remove_parens(s,multi_disk_search):
... mds = '|'.join([re.escape(x) for x in multi_disk_search])
... return re.sub(f'\((?:{mds})[0-9A-Za-z ]*\)','',s,0,re.I)
...
>>> multi_disk_search = ['disk','cd','disc','part','pt']
>>> remove_parens('this is a (disc a) string with (123xyz) parens removed',multi_disk_search)
'this is a string with (123xyz) parens removed'

Related

Regex to fix (all the matches or none) at the end to one

I'm trying to fix the . at the end to only one in a string. For example,
line = "python...is...fun..."
I have the regex \.*$ in Ruby, which is to be replaced by a single ., as in this demo, which don't seem to work as expected. I've searched for similar posts, and the closest I'd got is this answer in Python, which suggests the following,
>>> text1 = 'python...is...fun...'
>>> new_text = re.sub(r"\.+$", ".", text1)
>>> 'python...is...fun.'
But, it fails if I've no . at the end. So, I've tried like \b\.*$, as seen here, but this fails on the 3rd test which has some ?'s at end.
My question is, why \.*$ not matches all the .'s (despite of being greedy) and how to do the problem correctly?
Expected output:
python...is...fun.
python...is...fun.
python...is...fun??.
You might use an alternation matching either 2 or more dots or assert that what is directly to the left is not one of for example ! ? or a dot itself.
In the replacement use a single dot.
(?:\.{2,}|(?<!\.))$
Explanation
(?: Non capture group for the alternation
\.{2,} Match 2 or more dots
| Or
(?<!\.) Get the position where directly to the left is not a . (which you can extend with other characters as desired)
) Close non capture group
$ End of string (Or use \Z if there can be no newline following)
Regex demo | Python demo
For example
import re
strings = [
"python...is...fun...",
"python...is...fun",
"python...is...fun??"
]
for s in strings:
new_text = re.sub(r"(?:\.{2,}|(?<!\.))$", ".", s)
print(new_text)
Output
python...is...fun.
python...is...fun.
python...is...fun??.
If an empty string should not be replaced by a dot, you can use a positive lookbehind.
(?:\.{2,}|(?<=[^\s.]))$
Regex demo

How do you write a regex in python that finds all word which contain only letters, numbers and underscore?

This is the best I was able to come up with:
b = re.findall(r'\b[a-zA-Z0-9_]\b', 'ahz2gb_ $f heyght78_')
But that doesn't work. Also, not that I'm only interested in regexes at the moment. I can solve the problem the long way.
The expected result is a list containing [ahz2gb_, heyght78_]
There is \w for capturing those characters, and you need to allow more than one character with +:
b = re.findall(r'\b\w+\b', 'ahz2gb_ $f heyght78_')
As + is greedy, you don't really need the \b either:
b = re.findall(r'\w+', 'ahz2gb_ $f heyght78_')
If you need words to be split by white space only (not \b), then you can use look-around:
b = re.findall(r'(?<!\S)\w+(?!\S)', 'ahz2gb_ $f heyght78_')
The (?<! sequence means: look back to see you don't have the pattern that follows (?<! preceding the current matching position in the target string. So in this case (?<!\S) means: there should not be a preceding non-white-space character.
Then (?! is similar, but looking forward (without matching).
Simple to understand will be regex ..
^[0-9a-zA-Z_]+$ : strictly numbers, alphabets and underscore
^[0-9a-zA-Z_ ]+$ : strictly numbers, alphabets, underscore and spaces
If you need words from the matched lines, then spilt using space as delimiter.
You can try python regex online on http://pythex.org/
Sample Run on IDLE
>>> import re
>>> re.findall(r'^[a-zA-Z0-9_ ]+$', 'ahz2gb_ f heyght78_')[0].split(' ')
['ahz2gb_', 'f', 'heyght78_']
EDIT: Given new requirement of only having words, here is how you can achieve the same.
import re
mylist = 'ahz2gb_ $f heyght78_'.split(' ')
r = re.compile("^[0-9a-zA-Z_]+$")
newlist = list(filter(r.match, mylist))
print(newlist)
Wish, I could shorten it!!
Sample Run
========= RESTART: C:/regex.py =========
['ahz2gb_', 'heyght78_']

Is is possible to clean a verbose python regex before printing it?

The Setup:
Let's say I have the following regex defined in my script. I want to keep the comments there for future me because I'm quite forgetful.
RE_TEST = re.compile(r"""[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
""",
re.VERBOSE)
print(magic_function(RE_TEST)) # returns: "[0-9][A-Z][a-y]z"
The Question:
Does Python (3.4+) have a way to convert that to the simple string "[0-9][A-Z][a-y]z"?
Possible Solutions:
This question ("strip a verbose python regex") seems to be pretty close to what I'm asking for and it was answered. But that was a few years ago, so I'm wondering if a new (preferably built-in) solution has been found.
In addition to the above, there are work-arounds such as using implicit string concatenation and then using the .pattern attribute:
RE_TEST = re.compile(r"[0-9]" # 1 Number
r"[A-Z]" # 1 Uppercase Letter
r"[a-y]" # 1 lowercase, but not z
r"z", # gotta have z...
re.VERBOSE)
print(RE_TEST.pattern) # returns: "[0-9][A-Z][a-y]z"
or just commenting the pattern separately and not compiling it:
# matches pattern "nXxz"
RE_TEST = "[0-9][A-Z][a-y]z"
print(RE_TEST)
But I'd really like to keep the compiled regex the way it is (1st example). Perhaps I'm pulling the regex string from some file, and that file is already using the verbose form.
Background
I'm asking because I want to suggest an edit to the unittest module.
Right now, if you run assertRegex(string, pattern) using a compiled pattern with comments and that assertion fails, then the printed output is somewhat ugly (the below is a dummy regex):
Traceback (most recent call last):
File "verify_yaml.py", line 113, in test_verify_mask_names
self.assertRegex(mask, RE_MASK)
AssertionError: Regex didn't match: '(X[1-9]X[0-9]{2}) # comment\n |(XXX[0-9]{2}) # comment\n |(XXXX[0-9E]) # comment\n |(XXXX[O1-9]) # c
omment\n |(XXX[0-9][0-9]) # comment\n |(XXXX[
1-9]) # comment\n ' not found in 'string'
I'm going to propse that the assertRegex and assertNotRegex methods clean the regex before printing it by either removing the comments and extra whitespace or by printing it differently.
The following tested script includes a function that does a pretty good job converting an xmode regex string to non-xmode:
pcre_detidy(retext)
# Function pcre_detidy to convert xmode regex string to non-xmode.
# Rev: 20160225_1800
import re
def detidy_cb(m):
if m.group(2): return m.group(2)
if m.group(3): return m.group(3)
return ""
def pcre_detidy(retext):
decomment = re.compile(r"""(?#!py/mx decomment Rev:20160225_1800)
# Discard whitespace, comments and the escapes of escaped spaces and hashes.
( (?: \s+ # Either g1of3 $1: Stuff to discard (3 types). Either ws,
| \#.* # or comments,
| \\(?=[\r\n]|$) # or lone escape at EOL/EOS.
)+ # End one or more from 3 discardables.
) # End $1: Stuff to discard.
| ( [^\[(\s#\\]+ # Or g2of3 $2: Stuff to keep. Either non-[(\s# \\.
| \\[^# Q\r\n] # Or escaped-anything-but: hash, space, Q or EOL.
| \( # Or an open parentheses, optionally
(?:\?\#[^)]*(?:\)|$))? # starting a (?# Comment group).
| \[\^?\]? [^\[\]\\]* # Or Character class. Allow unescaped ] if first char.
(?:\\[^Q][^\[\]\\]*)* # {normal*} Zero or more non-[], non-escaped-Q.
(?: # Begin unrolling loop {((special1|2) normal*)*}.
(?: \[(?::\^?\w+:\])? # Either special1: "[", optional [:POSIX:] char class.
| \\Q [^\\]* # Or special2: \Q..\E literal text. Begin with \Q.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) [^\[\]\\]* # End special: One of 2 alternatives {(special1|2)}.
(?:\\[^Q][^\[\]\\]*)* # More {normal*} Zero or more non-[], non-escaped-Q.
)* (?:\]|\\?$) # End character class with ']' or EOL (or \\EOL).
| \\Q [^\\]* # Or \Q..\E literal text start delimiter.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) # End $2: Stuff to keep.
| \\([# ]) # Or g3of3 $6: Escaped-[hash|space], discard the escape.
""", re.VERBOSE | re.MULTILINE)
return re.sub(decomment, detidy_cb, retext)
test_text = r"""
[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
"""
print(pcre_detidy(test_text))
This function detidies regexes written in pcre-8/pcre2-10 xmode syntax.
It preserves whitespace inside [character classes], (?#comment groups) and \Q...\E literal text spans.
RegexTidy
The above decomment regex, is a variant of one I am using in my upcoming, yet to be released: RegexTidy application, which will not only detidy a regex as shown above (which is pretty easy to do), but it will also go the other way and Tidy a regex - i.e. convert it from non-xmode regex to xmode syntax, adding whitespace indentation to nested groups as well as adding comments (which is harder).
p.s. Before giving this answer a downvote on general principle because it uses a regex longer than a couple lines, please add a comment describing one example which is not handled correctly. Cheers!
Looking through the way sre_parse handles this, there really isn't any point where your verbose regex gets "converted" into a regular one and then parsed. Rather, your verbose regex is being fed directly to the parser, where the presence of the VERBOSE flag makes it ignore unescaped whitespace outside character classes, and from unescaped # to end-of-line if it is not inside a character class or a capture group (which is missing from the docs).
The outcome of parsing your verbose regex there is not "[0-9][A-Z][a-y]z". Rather it is:
[(IN, [(RANGE, (48, 57))]), (IN, [(RANGE, (65, 90))]), (IN, [(RANGE, (97, 121))]), (LITERAL, 122)]
In order to do a proper job of converting your verbose regex to "[0-9][A-Z][a-y]z" you could parse it yourself. You could do this with a library like pyparsing. The other answer linked in your question uses regex, which will generally not duplicate the behavior correctly (specifically, spaces inside character classes and # inside capture groups/character classes. And even just dealing with escaping is not as convenient as with a good parser.)

Regex help to match groups

I am trying to write a regex for matching a text file that has multiple lines such as :
* 964 0050.56aa.3480 dynamic 200 F F Veth1379
* 930 0025.b52a.dd7e static 0 F F Veth1469
My intention is to match the "0050.56aa.3480 " and "Veth1379" and put them in group(1) & group(2) for using later on.
The regex I wrote is :
\*\s*\d{1,}\s*(\d{1,}\.(?:[a-z][a-z]*[0-9]+[a-z0-9]*)\.\d{1,})\s*(?:[a-z][a-z]+)\s*\d{1,}\s*.\s*.\s*((?:[a-z][a-z]*[0-9]+[a-z0-9]*))
But it does not seem to be working when I test at:
http://www.pythonregex.com/
Could someone point to any obvious error I am doing here.
Thanks,
~Newbie
Try this:
^\* [0-9]{3} +([0-9]{4}.[0-9a-z]{4}.[0-9a-z]{4}).*(Veth[0-9]{4})$
Debuggex Demo
The first part is in capture group one, the "Veth" code in capture group two.
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a list of online testers in the bottom section.
I don't think you need a regex for this:
for line in open('myfile','r').readlines():
fields = line.split( )
print "\n" + fields[1] + "\n" +fields[6]
A very strict version would look something like this:
^\*\s+\d{3}\s+(\d{4}(?:\.[0-9a-f]{4}){2})\s+\w+\s+\d+\s+\w\s+\w\s+([0-9A-Za-z]+)$
Debuggex Demo
Here I assume that:
the columns will be pretty much the same,
your first match group contains a group of decimal digits and two groups of lower-case hex digits,
and the last word can be anything.
A few notes:
\d+ is equivalent to \d{1,} or [0-9]{1,}, but reads better (imo)
use \. to match a literal ., as . would simply match anything
[a-z]{2} is equivalent to [a-z][a-z], but reads better (my opinion, again)
however, you might want to use \w instead to match a word character
This will do it:
reobj = re.compile(r"^.*?([\w]{4}\.[\w]{4}\.[\w]{4}).*?([\w]+)$", re.IGNORECASE | re.MULTILINE)
match = reobj.search(subject)
if match:
group1 = match.group(1)
group2 = match.group(2)
else:
result = ""

python's re: find words beginning from "string" in any case

I'm trying to make a regex, that will return list words that begin with barbar in any case. It must return not the whole word, but only matching part. For example, from string
string = u'baRbarus, semibarbarus: qui BARbari sunt, alteres BARBARos non sequuntur!'
# output is...
>>> ['baRbar', 'BARbar', 'BARBAR']
I've tried such code:
re.compile(ur"([\A\b]*)(barbar)", re.UNICODE | re.IGNORECASE).findall(string)
# it returns...
[(u'', u'baRbar'), (u'', u'barbar'), (u'', u'BARbar'), (u'', u'BARBAR')]
It seems that I missunderstood something. Could you help me, please? And it will be also great if you advice some good tutorials about re module. It's too hard to understand re from default Python's documentation. Thanks!
The following regex is sufficient for what you want to do (as long as flags are set):
\bbarbar
Example:
>>> s = u'baRbarus, semibarbarus: qui BARbari sunt, alteres BARBARos non sequuntur!'
>>> re.findall(r'\bbarbar', s, re.IGNORECASE | re.UNICODE)
[u'baRbar', u'BARbar', u'BARBAR']
Here are some comments on your current regex which may clarify why \bbarbar does the job:
[\A\b] - \A is normally the start of string, and \b is word boundary, but inside of a character class \b becomes a backspace and I'm not really sure what \A becomes
[\A\b]* - This is why your regex matched 'semibarbarus', the * means 0 or more so it doesn't require that portion to match, if you dropped the * and fixed the above problem it would work
([\A\b]*)(barbar) - Multiple groups mean that re.findall() will return a tuple of the groups, rather than just the portion you are interested in
Because you want to have only the words beginning with barbar you have to split the string before. So you should do something like this:
def findBarbarus(my_string):
result = []
for s in my_string.split(" "):
result += re.compile(ur"(^barbar)", re.UNICODE | re.IGNORECASE).findall(s)
return result
The ^ in the regular expression means, that the word must begin with barbar.
You could try...
string = 'baRbarus, semibarbarus: qui BARbari sunt, alteres BARBARos non sequuntur!'
l=re.findall(' barbar.+? |^barbar.+? ', string, re.IGNORECASE)
print l
Just for the record: If you use \A inside a character class e.g. r"[\A]", it should be treated like a literal A. However it is silently ignored. The same happens with \B and \Z.
I have reported the bug.

Categories

Resources