Python Regular Expression Matching: ## ## - python

I'm searching a file line by line for the occurrence of ##random_string##. It works except for the case of multiple #...
pattern='##(.*?)##'
prog=re.compile(pattern)
string='lala ###hey## there'
result=prog.search(string)
print re.sub(result.group(1), 'FOUND', string)
Desired Output:
"lala #FOUND there"
Instead I get the following because its grabbing the whole ###hey##:
"lala FOUND there"
So how would I ignore any number of # at the beginning or end, and only capture "##string##".

To match at least two hashes at either end:
pattern='##+(.*?)##+'

Your problem is with your inner match. You use ., which matches any character that isn't a line end, and that means it matches # as well. So when it gets ###hey##, it matches (.*?) to #hey.
The easy solution is to exclude the # character from the matchable set:
prog = re.compile(r'##([^#]*)##')
Protip: Use raw strings (e.g. r'') for regular expressions so you don't have to go crazy with backslash escapes.
Trying to allow # inside the hashes will make things much more complicated.
EDIT: If you do not want to allow blank inner text (i.e. "####" shouldn't match with an inner text of ""), then change it to:
prog = re.compile(r'##([^#]+)##')
+ means "one or more."

'^#{2,}([^#]*)#{2,}' -- any number of # >= 2 on either end
be careful with using lazy quantifiers like (.*?) because it'd match '##abc#####' and capture 'abc###'. also lazy quantifiers are very slow

Try the "block comment trick": /##((?:[^#]|#[^#])+?)##/

Adding + to regex, which means to match one or more character.
pattern='#+(.*?)#+'
prog=re.compile(pattern)
string='###HEY##'
result=prog.search(string)
print result.group(1)
Output:
HEY

have you considered doing it non-regex way?
>>> string='lala ####hey## there'
>>> string.split("####")[1].split("#")[0]
'hey'

>>> import re
>>> text= 'lala ###hey## there'
>>> matcher= re.compile(r"##[^#]+##")
>>> print matcher.sub("FOUND", text)
lala #FOUND there
>>>

Related

How to pull out language via regex

I have the following two string:
s1 = 'Audio: Dolby Digital 5.1 (English)'
s2 = 'Audio: Stereo (English, French)'
I want to pull out the first language in each string. Here is what I have so far:
re.search(r'\s\((.+)', s1)
['English)']
How would I improve this to work on both of the above?
You could use this which will only find the first language and it is only a small tweak to your existing code
f=re.findall(r'\((\w+)', s1)
e=re.findall(r'\((\w+)', s2)
if f:
print f
if e:
print e
f = ['English']
e = ['English']
if you only want the first language then you should be using search instead like so
f = re.search(r'\((\w+)', s1)
e = re.search(r'\((\w+)', s2)
if f:
print f.group(1)
if e:
print e.group(1)
This will print a string rather than a list since it is only finding one thing
Widen the search to start the phrase with a parenthesis or comma+space, and end with a parenthesis or comma+space:
>>> re.findall(r'\s(?:\(|, )(.+)(?:\)|, )', s2)
['English, French']
The ?: after a parenthesis indicates a non-capturing group.
You can then grab whichever language you're interested in with indexing.
Since the strings you're searching are actually pretty tidy, you can also do this without regex:
>>> s1.split('(')[1].split(')')[0].split(', ')[0]
'English'
>>> s2.split('(')[1].split(')')[0].split(', ')[0]
'English'
You can just use this simple modification of your regular expression:
\s\(([^,\n\)]+)
Regex101
You're looking for the text after the first LParen and before the first comma. So, a regex that would match this is:
\(([^,]*),
(Your answer will be in group 1)
Finally, I'd like to point you to https://www.debuggex.com/, which will help you easily visualize your regex questions.
Assuming languages are always at the end, surrounded by brackets and listed with ,:
(?<=\()\w+(?=(?:, \w+)*\)$)
See it in action
The idea is:
(?<=\() - the string should be preceded by an opening bracket(()
\w+ - the language itself is a sequence of letters
(?=(?:, \w+)*\)$) - after it, there can be zero or more other languages, separated with comma and space and after closing the bracket()) leaves us at the end of the string

how do you do regex in python

I have a string like this:
data='WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
I need to get rid of everything until the first instance of the underline (inclusive) in regex.
I've tried this:
re.sub("(^.*\_),"", data)
but this get rids of everything before all underlines
ProcessCpuUsage
I need it to be:
jvmRuntimeModule_ProcessCpuUsag
Use this instead:
from string import find
data='WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
result = data[find(data, "_")+1:]
print result
re.sub("(^.*\_),"", data)
This makes . match every character in the line. Once it gets to the end, and can't match any more ".", it goes to the next token. Oops, that's a underscore! So, it backtracks back before the _ProcessCpuUsage, where it can match a underscore at the start, and then complete the match.
You should ask the . multiplier to be less greedy. You also do not need to capture the contents. Drop the parens. The backslash does nothing. Drop it. The leading line-start anchor also does nothing. Drop it.
re.sub(".*?_,", data)
You have become a victim of greedy matching. The expression matches the longest sequence that it possibly can.
I know there's a way to turn off greedy matching, but I never remember it. Instead there's a trick I use when there's a character I want to stop at. Instead of matching on every character with . I match on every character except the one I want to stop at.
re.sub("(^[^_]*\_", "", data)
This should do:
import re
def get_last_part(d):
m = re.match('[^_]*_(.*)', d)
if m:
return m.group(1)
else:
return None
print get_last_part('WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage')
you can use str.index:
>>> data = 'WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
>>> data[data.index('_')+1:]
'jvmRuntimeModule_ProcessCpuUsage'
Using str.split
>>> data.split('_',1)[1]
'jvmRuntimeModule_ProcessCpuUsage'
Using str.find:
>>> data[data.find('_')+1:]
'jvmRuntimeModule_ProcessCpuUsage'
Take a look at string methods Here
Try this regex:
result = re.sub("^.*?_", "", text)
What the regex ^.*?_ does:
^ .. Assert that the position is at the beginning of the string.
.*? .. Match every character that is not a linebreak character
between zero and unlimitted times as few times as possible.
- .. Match the character _
Try using split():
s = 'WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
print(s.split('_',1)[1])
Result:
jvmRuntimeModule_ProcessCpuUsage

python regular expression grouping

My regular expression goal:
"If the sentence has a '#' in it, group all the stuff to the left of the '#' and group all the stuff to the right of the '#'. If the character doesn't have a '#', then just return the entire sentence as one group"
Examples of the two cases:
A) '120x4#Words' -> ('120x4', 'Words')
B) '120x4#9.5' -> ('120x4#9.5')
I made a regular expression that parses case A correctly
(.*)(?:#(.*))
# List the groups found
>>> r.groups()
(u'120x4', u'words')
But of course this won't work for case B -- I need to make "# and everything to the right of it" optional
So I tried to use the '?' "zero or none" operator on that second grouping to indicate it's optional.
(.*)(?:#(.*))?
But it gives me bad results. The first grouping eats up the entire string.
# List the groups found
>>> r.groups()
(u'120x4#words', None)
Guess I'm either misunderstanding the none-or-one '?' operator and how it works on groupings or I am misunderstanding how the first group is acting greedy and grabbing the entire string. I did try to make the first group 'reluctant', but that gave me a total no-match.
(.*?)(?:#(.*))?
# List the groups found
>>> r.groups()
(u'', None)
Simply use the standard str.split function:
s = '120x4#Words'
x = s.split( '#' )
If you still want a regex solution, use the following pattern:
([^#]+)(?:#(.*))?
(.*?)#(.*)|(.+)
this sjould work.See demo.
http://regex101.com/r/oC3nN4/14
use re.split :
>>> import re
>>> a='120x4#Words'
>>> re.split('#',a)
['120x4', 'Words']
>>> b='120x4#9.5'
>>> re.split('#',b)
['120x4#9.5']
>>>
Here's a verbose re solution. But, you're better off using str.split.
import re
REGEX = re.compile(r'''
\A
(?P<left>.*?)
(?:
[#]
(?P<right>.*)
)?
\Z
''', re.VERBOSE)
def parse(text):
match = REGEX.match(text)
if match:
return tuple(filter(None, match.groups()))
print(parse('120x4#Words'))
print(parse('120x4#9.5'))
Better solution
def parse(text):
return text.split('#', maxsplit=1)
print(parse('120x4#Words'))
print(parse('120x4#9.5'))

Match string between special characters

I've messed around with regex a little bit but am pretty unfamiliar with it for the most part. The string will in the format:
\n\n*text here, can be any spaces, etc. etc.*
The string that I will get will have two line breaks, followed by an asterisk, followed by text, and then ending with another asterisk.
I want to exclude the beginning \n\n from the returned text. This is the pattern that I've come up with so far and it seems to work:
pattern = "(?<=\\n\\n)\*(.*)(\*)"
match = re.search(pattern, string)
if match:
text = match.group()
print (text)
else:
print ("Nothing")
I'm wondering if there is a better way to go about matching this pattern or if the way I'm handling it is okay.
Thanks.
You can avoid capturing groups and have the whole match as result using:
pattern = r'(?<=\n\n\*)[^*]*(?=\*)'
Example:
import re
print re.findall(r'(?<=\n\n\*)[^*]*(?=\*)','\n\n*text here, can be any spaces, etc. etc.*')
If you want to include the asterisk in the result you can use instead:
pattern = r'(?<=\n\n)\*[^*]*\*'
Regular expressions are overkill in a case like this -- if the delimiters are always static and at the head/tail of the string:
>>> s = "\n\n*text here, can be any spaces, etc. etc.*"
>>> def CheckString(s):
... if s.startswith("\n\n*") and s.endswith("*"):
... return s[3:-1]
... else:
... return "(nothing)"
>>> CheckString(s)
'text here, can be any spaces, etc. etc.'
>>> CheckString("no delimiters")
'(nothing)'
(adjusting the slice indexes as needed -- it wasn't clear to me if you want to keep the leading/trailing '*' characters. If you want to keep them, change the slice to
return s[2:]

Remove duplicate chars using regex?

Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -
import re
re.sub("a*", "a", "aaaa") # gives 'a'
What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?
import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'
NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'
The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.
Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.
On side note...
Your example code for just a is actually buggy:
>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'
You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".
In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this
s="ababacbdefefbcdefde"
while re.search(r'([a-z])(.*)\1', s):
s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)
print s # prints 'abcdef'
A solution including all category:
re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')
gives:
'ab['

Categories

Resources