I am trying to parse digits before and after X from this string, but unable to get all the digits. Can someone help me pointing out what I am missing here?
>>> import re
>>> f = "abc_xyz1024X137M4B4abc_xyz"
>>> re.findall(".*\w+(\d+)X(\d+).*", f)
[('4', '137')]
Note that .*\w+(\d+)X(\d+).* first grabs all the 0+ chars as many as possible (the whole string) and then backracks trying to match the subsequent patterns. \w+ backtracks up to the point where the next char is a digit before X, so the first capturing group only contains the single digit before X, and the second one contains all the digits after X. Check this .*\w+(\d+)X(\d+).* debugger page.
You should only match and capture the digits, then match the X and then again match and capture the digits.
You may use
import re
f = "abc_xyz1024X137M4B4abc_xyz"
print(re.findall(r"(\d+)X(\d+)", f))
# => [('1024', '137')]
Or, if you are only interested in a single match:
m = re.search(r"(?P<x>\d+)X(?P<y>\d+)", f)
if m:
print(m.groupdict()) # => {'y': '137', 'x': '1024'}
See the Python demo and the regex demo.
In this particular example, another option is to split the string on the character "X". Then find the last set of consecutive digits in the left half of the split and the first set of consecutive digits in the right half of the split.
For example:
import re
f = "abc_xyz1024X137M4B4abc_xyz"
left, right = f.split("X")
print(right)
#137M4B4abc_xyz
print(left)
#abc_xyz1024
print((re.findall('\d+', left)[-1], re.findall('\d+', right)[0]))
#('1024', '137')
Related
From a list of strings ('16','160','1,2','100,11','1','16:','16:00'), I want to keep only the elements that
either have a comma between two digits (e.g. 1,2 or 100,11)
or have two digits (without comma) that are NOT followed by ":" (i.e. followed by nothing: e.g 16, or followed by anything but ":": e.g. 160)
I tried the following code using regex in Python:
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:[\d],[\d]|[\d][\d][^:]*)')
rate = list(filter(pattern_rate.search,string))
print(rate)
Print:
['16', '160', '1,2','100,11' '16:', '16:00']
To be correct, the script should keep the first three items and reject the rest, but my script fails at rejecting the last two items. I guess I'm using the "[^:]" sign incorrectly.
To be correct, the script should keep the first three items and reject
the rest,
You can match either 2 or more digits, or match 2 digits with a comma in between.
As the list contains only numbers, you could use re.match to start the match at the beginning of the string instead of re.search.
(?:\d{2,}|\d,\d)\Z
Explanation
(?: Non capture group
\d{2,} Match 2 or more digits
| Or
\d,\d Match 2 digits with a comma in between
) Close non capture group
\Z End of string
Regex demo | Python demo
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:\d{2,}|\d,\d)\Z')
rate = list(filter(pattern_rate.match,string))
print(rate)
Output
['16', '160', '1,2']
I recommend looking a bit deeper into a regex guide.
100 is not a digit and will not match \d. Also having groups [..] with one element inside is not necessary if you don't intend to negate or otherwise transform them.
The first query can be represented by (?:\d+,\d+). It's a non-capturing group, that detects comma-separated numbers of length greater equal to one.
Your second query will show anything matching three consecutive digits following any (*) amount of not colons.
You'll want to use something similar to (?:\d{2,}(?!:)). It's a non-capturing group, matching digits with length greater equal to two, that are not followed by a colon. ?! designates a negative lookahead.
In your python code, you'll want to use pattern_rate.match instead of pattern_rate.find as the latter one will return partial matches while the first one only returns full matches.
pattern_rate = re.compile(r'(?:\d+,\d+)|(?:\d{2,}(?!:))')
rate = list(filter(pattern_rate.match, string))
Not sure you need regex for that:
string = ['16','160','1,2','100,11','1','16:','16:00']
keep = []
for elem in string:
if ("," in elem and len(elem) == 3) or ( ":" not in elem and "," not in elem and len(elem) >= 2):
keep.append(elem)
print (keep)
Output:
['16', '160', '1,2']
Although not that much elegant, tends to be faster than using regex.
I am trying to capture a number from a string, which sometimes contains dot separators and sometimes it does not. In any case I need a number with the dot separator.
e.g.:
num = re.findall('\d{3}\.(?:\d{2}\.){4}\d{3}|\d{14}', txt)[0]
will capture both variations:
304.33.44.52.03.002
30433445203002
In case it captured the one without dots, I would need to add the dots with the systematic of:
AAA.BB.CC.DD.EE.FFF
How can I add those dots with Python?
Solution without regexp.
You can transform it to list and insert dots in required positions, ensuring that value is string.
n = 30433445203002
l = list(str(n))
Add dots in positions you need
l.insert(3, '.')
l.insert(6, '.')
l.insert(9, '.')
l.insert(12, '.')
l.insert(15, '.')
If this is well-defined pattern. You can generalize the insertion above.
After insertion is done, join them back to the string:
num = "".join(l)
Input:
30433445203002
Output:
304.33.44.52.03.002
You can capture each "group" of numbers into a capturing group, and refer to it in the replacement string. The dots can be made optional with \.?.
string = "30433445203002"
regex = r"(\d{3})\.?(\d{2})\.?(\d{2})\.?(\d{2})\.?(\d{2})\.?(\d{3})"
pattern = "\\1.\\2.\\3.\\4.\\5.\\6"
result = re.sub(regex, pattern, string)
For more details, take a look on re.sub
Output:
304.33.44.52.03.002
Regex Demo
EDIT:
If I have misunderstood you and what you actually want is to get the first 3 numbers, 4th and 5th numbers, 6th and 7th numbers etc, you can use the same regex with search:
re.search(regex, string).group(1) # 304
re.search(regex, string).group(2) # 33
I'm using python and regex (new to both) to find sequence of chars in a string as follows:
Grab the first instance of p followed by any number (It'll always be in the form of p_ _ where _ and _ will be integers). Then either find an 's' or a 'go' then all integers till the end of the string. For example:
ascjksdcvyp12nbvnzxcmgonbmbh12hjg23
should yield p12 go 12 23.
ascjksdcvyp12nbvnzxcmsnbmbh12hjg23
should yield p12 s 12 23.
I've only managed to get the p12 part of the string and this is what I've tried so far to extract the 'go' or 's':
decoded = (re.findall(r'([p][0-9]*)',myStr))
print(decoded) //prints p12
I know by doing something like
re.findall(r'[s]|[go]',myStr)
will give me all occurrences of s and g and o, but something like that is not what I'm looking for. And I'm not sure how I'd combine these regexes to get the desired output.
Use re.findall with pattern grouping:
>>> string = 'ascjksdcvyp12nbvnzxcmgonbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 'go', '12', '23')]
>>> string = 'ascjksdcvyp12nbvnzxcmsnbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 's', '12', '23')]
With re.findall we are only willing to get what are matched by pattern grouping ()
p\d{2} matches any two digits after p
After that .* matches anything
Then, s|go matches either s or go
\D* matches any number of non-digits
\d+ indicates one or more digits
(?:) is a non-capturing group i.e. the match inside won't show up in the output, it is only for the sake of grouping tokens
Note:
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+?', string)
[('p12', 's', '12')]
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+', string)
[('p12', 's', '23')]
I would like to use one of the above two as matching later digits is kind of a repeated task but there are problems with both non-greedy and greedy matches, hence we need to match the digits after s or go well, kind of explicitly.
First, try to match your line with a minimal pattern, as a test. Use (grouping) and (?:nongrouping) parens to capture the interesting parts and not capture the uninteresting parts. Store away what you care about,
then chop off the remainder of the string and search for numbers as a second step.
import re
simple_test = r'^.*p(\d{2}).*?(?:s|go).*?(\d+)'
m = re.match(simple_test, line)
if m is not None:
p_num = m.group(1)
trailing_numbers = [m.group(2)]
remainder = line[m.end()+1:]
trailing_numbers.extend( # extend list by appending
map( # list from applying
lambda m: m.group(1), # get group(1) from match
re.finditer(r"(\d+)", remainder) # of each number in string
)
)
print("P:", p_num, "Numbers:", trailing_numbers)
I have the following case, where in my string I have improperly formatted mentions of the form "(19561958)" that I would like to split into "(1956-1958)". The regular expression that I tried is:
import re
a = "(19561958)"
re.sub(r"(\d\d\d\d\d\d\d\d)", r"\1-", a)
but this returns me "(19561958-)". How can I achieve my purpose? Many thanks!
You could capture the two years separately, and insert the hyphen between the two groups:
>>> import re
>>> re.sub(r'(\d{4})(\d{4})', r'\1-\2', '(19561958)')
'(1956-1958)'
Note that \d\d\d\d is written more concisely as \d{4}.
As currently written, this will insert a hyphen between the first two groups of four in any eight-digit-plus number. If you require the parentheses for the match, you can include them explicitly with look-arounds:
>>> re.sub(r'''
(?<=\() # make sure there's an opening parenthesis prior to the groups
(\d{4}) # one group of four digits
(\d{4}) # and a second group of four digits
(?=\)) # with a closing parenthesis after the two groups
''', r'\1-\2', '(19561958)', flags=re.VERBOSE)
'(1956-1958)'
Alternatively, you could use word boundaries, which would also deal with e.g. spaces around an eight-digit number:
>>> re.sub(r'\b(\d{4})(\d{4})\b', r'\1-\2', '(19561958)')
'(1956-1958)'
Use two capturing groups: r"(\d\d\d\d)(\d\d\d\d)" or r"(\d{4})(\d{4})".
The 2nd group is referenced with \2.
You could use capturing groups or look arounds.
re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
\d{4} matches exactly 4 digits.
Example:
>>> a = "(19561958)"
>>> re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
'(1956-1958)'
OR
Through lookarounds.
>>> a = "(19561958)"
>>> re.sub(r"(?<=\(\d{4})(?=\d{4}\))", r"-", a)
'(1956-1958)'
(?<=\(\d{4}) Positive lookbehind which asserts that the match must be preceded by ( and four digit characters.
(?=\d{4}\)) Posiitve lookahead which asserts that the match must be followed by 4 digits plus ) symbol.
Here a boundary got matched. Replacing the matched boundary with - will give you the desired output.
I'm a Python beginner, so keep in mind my regex skills are level -122.
I need to convert a string with text containing file1 to file01, but not convert file10 to file010.
My program is wrong, but this is the closest I can get, I've tried dozens of combinations but I can't get close:
import re
txt = 'file8, file9, file10'
pat = r"[0-9]"
regexp = re.compile(pat)
print(regexp.sub(r"0\d", txt))
Can someone tell me what's wrong with my pattern and substitution and give me some suggestions?
You could capture the number and check the length before adding 0, but you might be able to use this instead:
import re
txt = 'file8, file9, file10'
pat = r"(?<!\d)(\d)(?=,|$)"
regexp = re.compile(pat)
print(regexp.sub(r"0\1", txt))
regex101 demo
(?<! ... ) is called a negative lookbehind. This prevents (negative) a match if the pattern after it has the pattern in the negative lookbehind matches. For example, (?<!a)b will match all b in a string, except if it has an a before it, meaning bb, cb matches, but ab doesn't match. (?<!\d)(\d) thus matches a digit, unless it has another digit before it.
(\d) is a single digit, enclosed in a capture group, denoted by simple parentheses. The captured group gets stored in the first capture group.
(?= ... ) is a positive lookahead. This matches only if the pattern inside the positive lookahead matches after the pattern before this positive lookahead. In other words, a(?=b) will match all a in a string only if there's a b after it. ab matches, but ac or aa don't.
(?=,|$) is a positive lookahead containing ,|$ meaning either a comma, or the end of the string.
(?<!\d)(\d)(?=,|$) thus matches any digit, as long as there's no digit before it and there's a comma after it, or if that digit is at the end of the string.
how about?
a='file1'
a='file' + "%02d" % int(a.split('file')[1])
This approach uses a regex to find every sequence of digits and str.zfill to pad with zeros:
>>> txt = 'file8, file9, file10'
>>> re.sub(r'\d+', lambda m : m.group().zfill(2), txt)
'file08, file09, file10'