python regex sub without order - python

I have following string "3 0ABC, mNone\n" and I want to remove m None and \n. The catch is that 'm', \n and None can be anywhere in the string in any order. I would appreciate any help.
I can do re.sub('[\nm,]','',string) or re.sub('None','',string) but don't know how to combine specially when the order doesn't matter.

If you want to remove m, None and \n you can use them as pattern together in a group. So you can use this regex:
(m|\\n|None)
Working demo
If you use the following code:
import re
p = re.compile(ur'(m|\\n|None)')
test_str = u"3 0ABC, mNone\n"
subst = u""
result = re.sub(p, subst, test_str)
print result
// Will show:
'3 0ABC, '

Related

Python re regex sub letters not surrounded in quotes and not if they match specific word including regex group / match

I need to sub letters not surrounded in quotes and not if they match the word TODAY with a particular string where a part of it includes the match group e.g.
import re
import string
s = 'AB+B+" HELLO"+TODAY()/C* 100'
x = re.sub(r'\"[^"]*\"|\bTODAY\b|([A-Z]+)', r'a2num("\g<0>")', s)
print (x)
expected output:
'a2num("AB")+a2num("B")+" HELLO"+TODAY()/a2num("C")* 100'
actual output:
'a2num("AB")+a2num("B")+a2num("" HELLO"")+a2num("TODAY")()/a2num("C")* 100'
I am nearly there but it is not obeying the quote rules or the TODAY word rule, I know the string doesn't make any sense but it's just a harsh test of the regex
Your regex approach is correct but you need to use a lambda function in re.sub
>>> s = 'AB+B+" HELLO"+TODAY()/C* 100'
>>> rs = re.sub(r'"[^"]*"|\bTODAY\b|\b([A-Z]+)\b',
... lambda m: 'a2num("' + m.group(1) + '")' if m.group(1) else m.group(), s)
>>> print (rs)
a2num("AB")+a2num("B")+" HELLO"+TODAY()/a2num("C")* 100
Code Demo

Python how to replace content in the capture group of regex?

-abc1234567-abc.jpg
I wish to remove -abc before .jpg, and get -abc1234567.jpg. I tried re.sub(r'\d(-abc).jpg$', '', string), but it will also replace contents outside of the capture group, and give me -abc123456. Is it possible to only replace the content in the capture group i.e. '-abc'?
One solution is to use positive lookahead as follows.
import re
p = re.compile(ur'(\-abc)(?=\.jpg)')
test_str = u"-abc1234567-abc.jpg"
subst = u""
result = re.sub(p, subst, test_str)
OR
You can use two capture groups as follows.
import re
p = re.compile(ur'(\-abc)(\.jpg)')
test_str = u"-abc1234567-abc.jpg"
subst = r"\2"
result = re.sub(p, subst, test_str)
If you only want to remove -abc in only jpg files, you could use:
re.sub(r"-abc\.jpg$", ".jpg", string)
To use your code as close as possible: you should place '()' around the part you want to keep, not the part you want to remove. Then use \g<NUMBER> to select that part of the string. So:
re.sub(r'(.*)-abc(\.jpg)$', '\g<1>\g<2>', string)

How can I "divide" words with regular expressions?

I have a sentence in which every token has a / in it. I want to just print what I have before the slash.
What I have now is basic:
text = less/RBR.....
return re.findall(r'\b(\S+)\b', text)
This obviously just prints the text, how do I cut off the words before the /?
Assuming you want all characters before the slash out of every word that contains a slash. This would mean e.g. for the input string match/this but nothing here but another/one you would want the results match and another.
With regex:
import re
result = re.findall(r"\b(\w*?)/\w*?\b", my_string)
print(result)
Without regex:
result = [word.split("/")[0] for word in my_string.split()]
print(result)
Simple and straight-forward:
rx = r'^[^/]+'
# anchor it to the beginning
# the class says: match everything not a forward slash as many times as possible
In Python this would be:
import re
text = "less/RBR....."
print re.match(r'[^/]+', text)
As this is an object, you'd probably like to print it out, like so:
print re.match(r'[^/]+', text).group(0)
# less
This should also work
\b([^\s/]+)(?=/)\b
Python Code
p = re.compile(r'\b([^\s/]+)(?=/)\b')
test_str = "less/RBR/...."
print(re.findall(p, test_str))
Ideone Demo

Python match and replace, what I do wrong?

I have reg exp for match some data (is it here) and now I try to replace all matched data with single : characetr
test_str = u"THERE IS MY DATA"
p = re.compile(ur'[a-z]+([\n].*?<\/div>[\n ]+<div class="large-3 small-3 columns">[\n ]+)[a-z]+', re.M|re.I|re.SE)
print re.sub(p, r':/1',test_str)
I try it on few other way but it's not replace any or replace not only matched but whole pattern
1)It's backslash issue.
Use : print re.sub(p, r':\1',test_str) not print re.sub(p, r':/1',test_str) .
2)You are replacing all the pattern with :\1, that means replace all the text with : followed by the first group in the regex.
To replace just the first group inside the text you should add two groups , before the first and after.
I hope this will fix the issue:
test_str = u"THERE IS MY DATA"
p = re.compile(ur'([a-z]+)([\n].*?<\/div>[\n ]+<div class="large-3 small-3 columns">[\n ]+)([a-z]+)', re.M|re.I|re.SE)
print re.sub(p, r'\1:\2\3',test_str)

python regular expression $[a-zA-Z ][0-9a-zA-Z ]* with forbidding

I am looking for regular expression in Python, that in string finds me:
$[a-zA-Z_][0-9a-zA-Z_]*
there can be more of this and they can be seperated by whitespaces (\s)
This all would be quite easy, but I also need to forbid the whole string if there is anything that did not match the pattern. (+ empty string is also ok)
I'll give you some examples:
$x$y0123 => OK, gives me [$x, $y0123]
$ => BAD (only $)
"" or " \t" => OK, gives me []
$x #hi => BAD, cause #hi, does not match the pattern
It can be more regular expressions, it doesn't have to be just one.
regex = re.compile("(\$[a-zA-Z_][0-9a-zA-Z_]*)") regex.findall(string)
This would be ok, if i don't have to check those things also.
Hmm, I'm not entirely sure what you're trying to do, but maybe you need 2 regex: the first to check the validity of the format, and the second to retrieve the matches.
import re
stuff = ["$x$y0123", "$", "", " \t", "$x #hi"]
p1 = re.compile(r'(?:\$[A-Z_]\w*|\s)*$', re.IGNORECASE)
p2 = re.compile(r'\$[A-Z_]\w*|\s+', re.IGNORECASE)
for thing in stuff:
if p1.match(thing):
print(p2.findall(thing))
Will print:
['$x', '$y0123']
[]
[' \t']
ideone demo
To check a whole string you better use re.match function instead of re.findall and pattern that also allows spases will be like this ^((\$[a-zA-Z_][0-9a-zA-Z_])|(\s))*$
Try this:
import re
s1 = '$x$y0123 $_xyz1$B0dR_'
s2 = '$x$y0123 $_xyz1$B0dR_ #1'
s3 = '$'
s4 = ' \t'
s5 = ''
def process(s, pattern):
'''Find substrings in s that match pattern
if string is not completely composed of substings that match pattern
raises AttributeError
s --> str
pattern --> str
returns list
'''
rex = re.compile(pattern)
matches = list()
while s:
## print '*'*8
## print s1
m = rex.match(s)
matches.append(m)
## print '\t', m.group(), m.span(), m.endpos
s = s[m.end():]
return matches
pattern = '\$[a-zA-Z_][0-9a-zA-Z_]*'
for s in [s1, s2, s3, s4, s5]:
print '*'*8
# remove whitespace
s = re.sub('\s', '', s)
if not s:
print 'empty string'
continue
try:
matches = process(s, pattern)
except AttributeError:
print 'this string has bad stuff in it'
print s
continue
print '\n'.join(m.group() for m in matches)
>>>
********
$x
$y0123
$_xyz1
$B0dR_
********
this string has bad stuff in it
$x$y0123$_xyz1$B0dR_#1
********
this string has bad stuff in it
$
********
empty string
********
empty string
>>>

Categories

Resources