I'm trying to extract out the number before the - and the rest of the string after it, but it's not able to extract out both. Here's the output from the interactive terminal:
>>> a = '#232 - Hello There'
>>> re.findall('#(.*?) - (.*?)', a)
[('232', '')]
Why is my regex not working properly?
.*? is non-greedy i.e. it will match the smallest substring, you need the greedy version i.e. .* (matches longest substring) for the latter one:
In [1143]: a = '#232 - Hello There'
In [1144]: re.findall('#(.*?) - (.*?)', a)
Out[1144]: [('232', '')]
In [1145]: re.findall('#(.*?) - (.*)', a)
Out[1145]: [('232', 'Hello There')]
But you should use str methods to process such simple cases e.g. using str.split with splitting on -:
In [1146]: a.split(' - ')
Out[1146]: ['#232', 'Hello There']
With str.partition on - and slicing:
In [1147]: a.partition(' - ')[::2]
Out[1147]: ('#232', 'Hello There')
This expression might likely extract those desired values:
([0-9]+)\s*-\s*(.*)
Demo
Test
import re
print(re.findall("([0-9]+)\s*-\s*(.*)", "#232 - Hello There"))
Output
[('232', 'Hello There')]
Your regex is fine, you're just using the wrong function from re. The following matches things correctly:
m = re.fullmatch('#(.*?) - (.*?)', a)
Related
I am trying to extract the ticket number from an email reply subject message. The subject message typically looks like this:
s = 'Re: Test something before TICKET#ABC123 hello world something after'
I would like to extract the part TICKET#ABC123
How can I achieve this the best in Python? Is this the way to go for my purpose or do you have better suggestions to keep track of mail chains?
Without regex (using split() and startswith()):
s = 'Re: Test something before TICKET#ABC123 hello world something after'
splitted = s.split()
for x in splitted:
if x.startswith('TICKET#'):
print(x)
# TICKET#ABC123
You could use the following regex:
import re
s = 'Re: Test something before TICKET#ABC123 hello world something after'
re.findall(r'TICKET#[a-zA-Z0-9]+(?=\s)', s)
# ['TICKET#ABC123']
Explanation:
r'TICKET# - matches the characters r'TICKET# literally (case sensitive)
[a-zA-Z0-9] - Match a single character present in [a-zA-Z0-9]
+ - Quantifier Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=\s) - Positive Lookahead (?=\s)
\s- matches any whitespace character (equal to [\r\n\t\f\v ])
Using Regex.
Ex:
import re
s = 'Re: Test something before TICKET#ABC123 hello world something after'
m = re.search(r"TICKET#(\w+)", s)
if m:
print(m.group(1))
Output:
ABC123
Can't comment on #Rakesh.
But we need to change the regex a little bit, since expected result is TICKET#ABC123
Ex:
import re
s = 'Re: Test something before TICKET#ABC123 hello world something after'
m = re.search(r"(TICKET#(\w+))", s)
if m:
print(m.group(1))
Output:
TICKET#ABC123
If you want to get the ticket number, then you can use
m.group(2)
I would like to clean some input that was logged from my keyboard with python and regex.
Especially when backspace was used to fix a mistake.
Example 1:
[in]: 'Helloo<BckSp> world'
[out]: 'Hello world'
This can be done with
re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')
Example 2:
However when I have several backspaces, I don't know how to delete exactly the same number of characters before:
[in]: 'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'
(Here I want to remove 'l' and 'o' before the two backspaces).
I could simply use re.sub(r'[^>]<BckSp>', '', line) several times until there is no <BckSp> left but I would like to find a more elegant / faster solution.
Does anyone know how to do this ?
It looks like Python does not support recursive regex. If you can use another language, you could try this:
.(?R)?<BckSp>
See: https://regex101.com/r/OirPNn/1
It isn't very efficient but you can do that with the re module:
(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1
demo
This way you don't have to count, the pattern only uses the repetition.
(?:
[^<] # a character to remove
(?= # lookahead to reach the corresponding <BckSp>
[^<]* # skip characters until the first <BckSp>
( # capture group 1: contains the <BckSp>s
(?=(\1?))\2 # emulate an atomic group in place of \1?+
# The idea is to add the <BcKSp>s already matched in the
# previous repetitions if any to be sure that the following
# <BckSp> isn't already associated with a character
<BckSp> # corresponding <BckSp>
)
)
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>
\1 # matches all the consecutive <BckSp> and ensures that there's no more character
# between the last character to remove and the first <BckSp>
You can do the same with the regex module, but this time you don't need to emulate the possessive quantifier:
(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1
demo
But with the regex module, you can also use the recursion (as #Fallenhero noticed it):
[^<](?R)?<BckSp>
demo
Since there is no support for recursion/subroutine calls, no atomic groups/possessive quantifiers in Python re, you may remove these chars followed with backspaces in a loop:
import re
s = "Helllo\b\bo world"
r = re.compile("^\b+|[^\b]\b")
while r.search(s):
s = r.sub("", s)
print(s)
See the Python demo
The "^\b+|[^\b]\b" pattern will find 1+ backspace chars at the string start (with ^\b+) and [^\b]\b will find all non-overlapping occurrences of any char other than a backspace followed with a backspace.
Same approach in case a backspace is expressed as some enitity/tag like a literal <BckSp>:
import re
s = "Helllo<BckSp><BckSp>o world"
r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S)
while r.search(s):
s = r.sub("", s)
print(s)
See another Python demo
Slightly verbose but you can use this lambda function to count # of <BckSp> occurrence and use substring routines to get your final output.
>>> bk = '<BckSp>'
>>> s = 'Helllo<BckSp><BckSp>o world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world
>>> s = 'Helloo<BckSp> world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world
>>> s = 'Helloo<BckSp> worl<BckSp>d'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello word
>>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello work
In case the marker is single character you could just utilize stack which would give you the result in single pass:
s = "Helllo\b\bo world"
res = []
for c in s:
if c == '\b':
if res:
del res[-1]
else:
res.append(c)
print(''.join(res)) # Hello world
In case the marker is literally '<BckSp>' or some other string with length greater than 1 you can use replace to substitute it to '\b' and use the solution above. This only works if you know that '\b' doesn't occur in the input. If you can't designate a substitute character you could use split and process the results:
s = 'Helllo<BckSp><BckSp>o world'
res = []
for part in s.split('<BckSp>'):
if res:
del res[-1]
res.extend(part)
print(''.join(res)) # Hello world
I'm very new to regex, and i'm trying to find instances in a string where there exists a word consisting of either the letter w or e followed by 2 digits, such as e77 w10 etc.
Here's the regex that I currently have, which I think finds that (correct me if i'm wrong)
([e|w])\d{0,2}(\.\d{1,2})?
How can I add a space right after the letter e or w? If there are no instances where the criteria is met, I would like to keep the string as is. Do I need to use re.sub? I've read a bit about that.
Input: hello e77 world
Desired output: hello e 77 world
Thank You.
Your regex needs to just look like this:
([ew])(\d{2})
if you want to only match specifically 2 digits, or
([ew])(\d{1,2})
if you also want to match single digits like e4
The brackets are called capturing groups and could be back referenced in a search and replace, or with python, using re.sub
your replace string should look like
\1 \2
So it should be as simple as a line like:
re.sub(r'([ew])(\d{1,2})', r'\1 \2', your_string)
EDIT: working code
>>> import re
>>> your_string = 'hello e77 world'
>>>
>>> re.sub(r'([ew])(\d{1,2})', r'\1 \2', your_string)
'hello e 77 world'
This is what you're after:
import re
print(re.sub(r'([ew])(\d{1,2})', r'\g<1> \g<2>', 'hello e77 world'))
I have a string in python 2.7:
a = 'This is some text ( - 1) - And some more text (0:0) - Some more Text'
I would like to use a regex to get the ' - 1' out of this string.
I've tried but can't find it, thanks for your help, I've tried:
re.search(r'.*?\((.*)\).*', a)
But that didn't work. Mind you there's a second ( ) in the string but I only need the first one.
THANK YOU!
regexes are greedy by default. Your expression gets the first ( to the last )
you did that:
re.search(r'.*?\((.*)\).*', a)
instead, use
re.search(r'.*?\((.*?)\).*', a)
note the non-greedy version of the match .*? (I just added a question mark to your regex to make it work)
Variant: avoid closing parenthesis in your group capture
re.search(r'.*?\(([^)]*)\).*', a)
This would do it:
out = re.match('.*?\((.*?)\)', a).groups()[0]
if you wanted to remove from the original string,
a.replace(out, '')
Below code will return all the match in string:
>>> import re
>>> mystring = 'This is some text ( - 1) - And some more text (0:0) - Some more Text'
>>> regex = re.compile(".*?\((.*?)\)")
>>> result = re.findall(regex, mystring)
>>> result
[' - 1', '0:0']
To get first (or whatever value you want), access it by corresponding index:
>>> result[0]
'-1'
You can use re.sub to replace the first match found in the string, by using the optional count argument.
>>>a = 'This is some text ( - 1) - And some more text (0:0) - Some more Text'
>>>re.sub(r'\(.+?\)','',a,count=1)
What about:
tmp_1 = str.split('(', mystring )
tmp_2 = str.split(')', tmp[1])
result = tmp[0]
This process is simpler, and can be generalized if the brackets are nested.
I would like to get 2 captured groups for a pair of consecutive words. I use this regular expression:
r'\b(hello)\b(world)\b'
However, searching "hello world" with this regular expression yields no results:
regex = re.compile(r'\b(hello)\b(world)\b')
m = regex.match('hello world') # m evaluates to None.
You need to allow for space between the words:
>>> import re
>>> regex = re.compile(r'\b(hello)\s*\b(world)\b')
>>> regex.match('hello world')
<_sre.SRE_Match object at 0x7f6fcc249140>
>>>
Discussion
The regex \b(hello)\b(world)\b requires that the word hello end exactly where the word world begins but with a word break \b between them. That cannot happen. Adding space, \s, between them fixes this.
If you meant to allow punctuation or other separators between hello and world, then that possibility should be added to the regex.