Python regex: finding sequence of chars inside a string - python

I'm using python and regex (new to both) to find sequence of chars in a string as follows:
Grab the first instance of p followed by any number (It'll always be in the form of p_ _ where _ and _ will be integers). Then either find an 's' or a 'go' then all integers till the end of the string. For example:
ascjksdcvyp12nbvnzxcmgonbmbh12hjg23
should yield p12 go 12 23.
ascjksdcvyp12nbvnzxcmsnbmbh12hjg23
should yield p12 s 12 23.
I've only managed to get the p12 part of the string and this is what I've tried so far to extract the 'go' or 's':
decoded = (re.findall(r'([p][0-9]*)',myStr))
print(decoded) //prints p12
I know by doing something like
re.findall(r'[s]|[go]',myStr)
will give me all occurrences of s and g and o, but something like that is not what I'm looking for. And I'm not sure how I'd combine these regexes to get the desired output.

Use re.findall with pattern grouping:
>>> string = 'ascjksdcvyp12nbvnzxcmgonbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 'go', '12', '23')]
>>> string = 'ascjksdcvyp12nbvnzxcmsnbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 's', '12', '23')]
With re.findall we are only willing to get what are matched by pattern grouping ()
p\d{2} matches any two digits after p
After that .* matches anything
Then, s|go matches either s or go
\D* matches any number of non-digits
\d+ indicates one or more digits
(?:) is a non-capturing group i.e. the match inside won't show up in the output, it is only for the sake of grouping tokens
Note:
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+?', string)
[('p12', 's', '12')]
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+', string)
[('p12', 's', '23')]
I would like to use one of the above two as matching later digits is kind of a repeated task but there are problems with both non-greedy and greedy matches, hence we need to match the digits after s or go well, kind of explicitly.

First, try to match your line with a minimal pattern, as a test. Use (grouping) and (?:nongrouping) parens to capture the interesting parts and not capture the uninteresting parts. Store away what you care about,
then chop off the remainder of the string and search for numbers as a second step.
import re
simple_test = r'^.*p(\d{2}).*?(?:s|go).*?(\d+)'
m = re.match(simple_test, line)
if m is not None:
p_num = m.group(1)
trailing_numbers = [m.group(2)]
remainder = line[m.end()+1:]
trailing_numbers.extend( # extend list by appending
map( # list from applying
lambda m: m.group(1), # get group(1) from match
re.finditer(r"(\d+)", remainder) # of each number in string
)
)
print("P:", p_num, "Numbers:", trailing_numbers)

Related

Parse string to get digits before and after particular character

I am trying to parse digits before and after X from this string, but unable to get all the digits. Can someone help me pointing out what I am missing here?
>>> import re
>>> f = "abc_xyz1024X137M4B4abc_xyz"
>>> re.findall(".*\w+(\d+)X(\d+).*", f)
[('4', '137')]
Note that .*\w+(\d+)X(\d+).* first grabs all the 0+ chars as many as possible (the whole string) and then backracks trying to match the subsequent patterns. \w+ backtracks up to the point where the next char is a digit before X, so the first capturing group only contains the single digit before X, and the second one contains all the digits after X. Check this .*\w+(\d+)X(\d+).* debugger page.
You should only match and capture the digits, then match the X and then again match and capture the digits.
You may use
import re
f = "abc_xyz1024X137M4B4abc_xyz"
print(re.findall(r"(\d+)X(\d+)", f))
# => [('1024', '137')]
Or, if you are only interested in a single match:
m = re.search(r"(?P<x>\d+)X(?P<y>\d+)", f)
if m:
print(m.groupdict()) # => {'y': '137', 'x': '1024'}
See the Python demo and the regex demo.
In this particular example, another option is to split the string on the character "X". Then find the last set of consecutive digits in the left half of the split and the first set of consecutive digits in the right half of the split.
For example:
import re
f = "abc_xyz1024X137M4B4abc_xyz"
left, right = f.split("X")
print(right)
#137M4B4abc_xyz
print(left)
#abc_xyz1024
print((re.findall('\d+', left)[-1], re.findall('\d+', right)[0]))
#('1024', '137')

Regular expression matching all but a string

I need to find all the strings matching a pattern with the exception of two given strings.
For example, find all groups of letters with the exception of aa and bb. Starting from this string:
-a-bc-aa-def-bb-ghij-
Should return:
('a', 'bc', 'def', 'ghij')
I tried with this regular expression that captures 4 strings. I thought I was getting close, but (1) it doesn't work in Python and (2) I can't figure out how to exclude a few strings from the search. (Yes, I could remove them later, but my real regular expression does everything in one shot and I would like to include this last step in it.)
I said it doesn't work in Python because I tried this, expecting the exact same result, but instead I get only the first group:
>>> import re
>>> re.search('-(\w.*?)(?=-)', '-a-bc-def-ghij-').groups()
('a',)
I tried with negative look ahead, but I couldn't find a working solution for this case.
You can make use of negative look aheads.
For example,
>>> re.findall(r'-(?!aa|bb)([^-]+)', string)
['a', 'bc', 'def', 'ghij']
- Matches -
(?!aa|bb) Negative lookahead, checks if - is not followed by aa or bb
([^-]+) Matches ony or more character other than -
Edit
The above regex will not match those which start with aa or bb, for example like -aabc-. To take care of that we can add - to the lookaheads like,
>>> re.findall(r'-(?!aa-|bb-)([^-]+)', string)
You need to use a negative lookahead to restrict a more generic pattern, and a re.findall to find all matches.
Use
res = re.findall(r'-(?!(?:aa|bb)-)(\w+)(?=-)', s)
or - if your values in between hyphens can be any but a hyphen, use a negated character class [^-]:
res = re.findall(r'-(?!(?:aa|bb)-)([^-]+)(?=-)', s)
Here is the regex demo.
Details:
- - a hyphen
(?!(?:aa|bb)-) - if there is aaa- or bb- after the first hyphen, no match should be returned
(\w+) - Group 1 (this value will be returned by the re.findall call) capturing 1 or more word chars OR [^-]+ - 1 or more characters other than -
(?=-) - there must be a - after the word chars. The lookahead is required here to ensure overlapping matches (as this hyphen will be a starting point for the next match).
Python demo:
import re
p = re.compile(r'-(?!(?:aa|bb)-)([^-]+)(?=-)')
s = "-a-bc-aa-def-bb-ghij-"
print(p.findall(s)) # => ['a', 'bc', 'def', 'ghij']
Although a regex solution was asked for, I would argue that this problem can be solved easier with simpler python functions, namely string splitting and filtering:
input_list = "-a-bc-aa-def-bb-ghij-"
exclude = set(["aa", "bb"])
result = [s for s in input_list.split('-')[1:-1] if s not in exclude]
This solution has the additional advantage that result could also be turned into a generator and the result list does not need to be constructed explicitly.

Regular Expressions using Substitution to convert numbers

I'm a Python beginner, so keep in mind my regex skills are level -122.
I need to convert a string with text containing file1 to file01, but not convert file10 to file010.
My program is wrong, but this is the closest I can get, I've tried dozens of combinations but I can't get close:
import re
txt = 'file8, file9, file10'
pat = r"[0-9]"
regexp = re.compile(pat)
print(regexp.sub(r"0\d", txt))
Can someone tell me what's wrong with my pattern and substitution and give me some suggestions?
You could capture the number and check the length before adding 0, but you might be able to use this instead:
import re
txt = 'file8, file9, file10'
pat = r"(?<!\d)(\d)(?=,|$)"
regexp = re.compile(pat)
print(regexp.sub(r"0\1", txt))
regex101 demo
(?<! ... ) is called a negative lookbehind. This prevents (negative) a match if the pattern after it has the pattern in the negative lookbehind matches. For example, (?<!a)b will match all b in a string, except if it has an a before it, meaning bb, cb matches, but ab doesn't match. (?<!\d)(\d) thus matches a digit, unless it has another digit before it.
(\d) is a single digit, enclosed in a capture group, denoted by simple parentheses. The captured group gets stored in the first capture group.
(?= ... ) is a positive lookahead. This matches only if the pattern inside the positive lookahead matches after the pattern before this positive lookahead. In other words, a(?=b) will match all a in a string only if there's a b after it. ab matches, but ac or aa don't.
(?=,|$) is a positive lookahead containing ,|$ meaning either a comma, or the end of the string.
(?<!\d)(\d)(?=,|$) thus matches any digit, as long as there's no digit before it and there's a comma after it, or if that digit is at the end of the string.
how about?
a='file1'
a='file' + "%02d" % int(a.split('file')[1])
This approach uses a regex to find every sequence of digits and str.zfill to pad with zeros:
>>> txt = 'file8, file9, file10'
>>> re.sub(r'\d+', lambda m : m.group().zfill(2), txt)
'file08, file09, file10'

Exclude matched string python re.findall

I am using python's re.findall method to find occurrence of certain string value in Input string.
e.g. From search in 'ABCdef' string, I have two search requirements.
Find string starting from Single Capital letter.
After 1 find string that contains all capital letter.
e.g. input string and expected output will be:
'USA' -- output: ['USA']
'BObama' -- output: ['B', 'Obama']
'Institute20CSE' -- output: ['Institute', '20', 'CSE']
So My expectation from
>>> matched_value_list = re.findall ( '[A-Z][a-z]+|[A-Z]+' , 'ABCdef' )
is to return ['AB', 'Cdef'].
But which does Not seems to be happening. What I get is ['ABC'] as return value, which matches later part of regex with full string.
So Is there any way we can ignore found matches. So that once 'Cdef' is matched with '[A-Z][a-z]+'. second part of regex (i.e. '[A-Z]+') only matches with remaining string 'AB'?
First you need to match AB, which is followed by an Uppercase alphabet and then a lowercase alphabet. or is at the end of the string. For that you can use look-ahead.
Then you need to match an Uppercase alphabet C, followed by multiple lowercase alphabets def.
So, you can use this pattern:
>>> s = "ABCdef"
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", s)
['AB', 'Cdef']
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", 'MumABXYZCdefXYZAbc')
['Mum', 'ABXYZ', 'Cdef', 'XYZ', 'Abc']
As pointed out in comment by #sotapme, you can also modify the above regex to: -
"([A-Z]+(?=[A-Z]|$)|[A-Z][a-z]+|\d+)"
Added \d+ since you also want to match digit as in one of your example. Also, he removed [a-z] part from the first part of look-ahead. That works because, + quantifier on the [A-Z] outside is greedy by default, so, it will automatically match maximum string, and will stop only before the last upper case alphabet.
You can use this regex
[A-Z][a-zA-Z]*?(?=[A-Z][a-z]|[^a-zA-Z]|$)

Split a string using an integer as a delimeter

I have a rather long txt file filled with strings of the format {letter}{number}{letter}. For instance, the first few lines of my file are:
A123E
G234W
R3L
H4562T
I am having difficulty finding the correct regex pattern to separate each line by alpha and numeric.
For instance, in the first line, I would like an array with the results:
print first_line[0] // A
print first_line[1] // 123
ptin first_line[2] // E
It seems like regex would be the way to go, but I'm still a regex novice. Could someone help point me in the correct direction on how to do this?
Then I plan to iterate over each of the lines and use the info as necessary.
Split on \d+:
import re
re.split(r'(\d+)', line)
\d is the character class matching the digits 0 through to 9, and we want to match at least 1 of them. By putting a capturing group around the \d+, re.split() will include the match in the output:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
Demo:
>>> import re
>>> re.split(r'(\d+)', 'A123E')
['A', '123', 'E']

Categories

Resources