Finding overlapping sequence with regular expressions with Python - python

I'm trying to extract numbers and both previous and following characters (excluding digits and whitespaces) of a string. The expected return of the function is a list of tuples, with each tuple having the shape:
(previous_sequence, number, next_sequence)
For example:
string = '200gr T34S'
my_func(string)
>>[('', '200', 'gr'), ('T', '34', 'S')]
My first iteration was to use:
def my_func(string):
res_obj = re.findall(r'([^\d\s]+)?(\d+)([^\d\s]+)?', string)
But this function doesn't do what I expect when I pass a string like '2AB3' I would like to output [('','2','AB'), ('AB','3','')] and instead, it is showing [('','2','AB'), ('','3','')], because 'AB' is part of the previous output.
How could I fix this?

Since there is no overlapping numbers, a single trailing
assertion should be all you need.
Something like ([^\d\s]+)?(\d+)(?=([^\d\s]+)?)
This ([^\d\s]*)(\d+)(?=([^\d\s]*)) if you care about
the difference between NULL and the empty string.

Instead of modifier + and ? you can simply use * :
>>> re.findall(r'([^\d\s]*)(\d+)([^\d\s]*)',string)
[('', '200', 'gr'), ('T', '34', 'S')]
But if you mean to match the overlapped strings you can use a positive look ahead to fine all the overlapped matches :
>>> re.findall(r'(?=([^\d\s]*)(\d+)([^\d\s]*))','2AB3')
[('', '2', 'AB'), ('AB', '3', ''), ('B', '3', ''), ('', '3', '')]

Another way can be using regex and functions!
import re
#'200gr T34S' '2AB3'
def s(x):
tmp=[]
d = re.split(r'\s+|(\d+)',x)
d = ['' if v is None else v for v in d] #remove None
t_ = [i for i in d if len(i)>0]
digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
nms = [i for i in t_ if i[0] in digits]
for i in nms:
if d.index(i)==0:
tmp.append(('',i,d[d.index(i)+1]))
elif d.index(i)==len(d):
tmp.append((d[d.index(i)-1],i,''))
else:
tmp.append((d[d.index(i)-1],i,d[d.index(i)+1]))
return tmp
print s('2AB3')
Prints-
[('', '2', 'AB'), ('AB', '3', '')]

Related

python regular expression optional but mandatory if character precedes

I am trying to capture something along the lines of
1/2x1 + 3x2 - 4/5x3
I will strip the spaces before hand so it is not necessary to capture them in the regular expression. The concern that's happening is that I want the preceding coefficient to have the option of being a fraction. So if I see a / then it must have \d+ following it. I don't necessarily care to capture the /.
Ideally I would extract the groups as such:
# first match
match.groups(1)
('1', '2', 'x1')
#second match
('+', '3', 'x2')
#third match
('-', '4', '5', 'x3')
Something that is (sort of) working is ([+-])?(\d)+(\/\d)?([a-zA-Z]+\d+). However I don't love that it also captures the preceding '/'
Example output:
>>> regexp = re.compile('([+-])?(\d)+(\/\d)?([a-zA-Z]+\d+)')
>>> expr = '1/2a3+1/8x2-4x3'
>>> match = regexp.search(expr)
>>> match.groups(1)
(1, '1', '/2', 'a3')
>>> expr = expr.replace(match.group(0), '')
>>> match = regexp.search(expr)
>>> match.groups(1)
('+', '1', '/8', 'x2')
>>> expr = expr.replace(match.group(0), '')
>>> match = regexp.search(expr)
>>> match.groups(1)
('-', '4', 1, 'x3')
In the first match, what does the first element 1 mean? I see the same thing in the third match, third element. In both of these - that particular "group" is missing. So is that just a way of being like "I matched, but I didn't match anything"?
Another issue with the above regex, is it makes the [+-] optional. I want it to be optional on the first term, but it is mandatory on subsequent terms.
Anyways the above is usable, I'll need to peel off the /, and I can sanitize the input to ensure the +- are always there, but it's not as elegant as I'm sure it can be.
Thanks for any help
You could rework your regex slightly to use capturing groups only for things you want to capture and then use re.findall to extract all matches at once:
regexp = re.compile(r'([+-])?(\d+)(?:/(\d))?([a-zA-Z]+\d+)')
res = regexp.findall(expr)
Output:
[
('', '1', '2', 'a3'),
('+', '1', '8', 'x2'),
('-', '4', '', 'x3')
]
Note when there is no fraction (or sign on the first value) there may be empty values ('') in the tuple, you could (if required) filter that out e.g.
[tuple(filter(lambda x:x, tup)) for tup in res]
# [('1', '2', 'a3'), ('+', '1', '8', 'x2'), ('-', '4', 'x3')]
however then you would face the difficulty of knowing which value in each tuple corresponded to which part of the expression.

find hex value in list using regular expression

Input:
we have given list
lst = ['a', '4', 'add', 'e', 'a', 'c0a8d202', '128', '4', '0', '32']
using regular expression find the hexvalue in given list
output:
index of hexvalue(In our case hexvalue is c0a8d202 then return the index 5)
Try this:
import re
lst = ['a', '4', 'add', 'e', 'a', 'c0a8d202', '128', '4', '0', '32']
pattern = re.compile(r'c[0-9a-fA-F]?')
for i in lst:
if re.search(pattern, i):
print(lst.index(i))
Note:
this is as per your desired output but i am agree with #Jean-François Fabre who said that what's wrong with lst.index('c0a8d202') ? what's the point of regular expressions here when you have the value already?

Regex for split or findall each digit python

What is the best solution to split this str var into a continuous number list
My solution :
>>> str
> '2223334441214844'
>>> filter(None, re.split("(0+)|(1+)|(2+)|(3+)|(4+)|(5+)|(6+)|(7+)|(8+)|(9+)", str))
> ['222', '333', '444', '1', '2', '1', '4', '8', '44']
The more flexible way would be to use itertools.groupby which is made to match consecutive groups in iterables:
>>> s = '2223334441214844'
>>> import itertools
>>> [''.join(group) for key, group in itertools.groupby(s)]
['222', '333', '444', '1', '2', '1', '4', '8', '44']
The key would be the single key that is being grouped on (in your case, the digit). And the group is an iterable of all the items in the group. Since the source iterable is a string, each item is a character, so in order to get back the fully combined group, we need to join the characters back together.
You could also repeat the key for the length of the group to get this output:
>>> [key * len(list(group)) for key, group in itertools.groupby(s)]
['222', '333', '444', '1', '2', '1', '4', '8', '44']
If you wanted to use regular expressions, you could make use of backreferences to find consecutive characters without having to specify them explicitly:
>>> re.findall('((.)\\2*)', s)
[('222', '2'), ('333', '3'), ('444', '4'), ('1', '1'), ('2', '2'), ('1', '1'), ('4', '4'), ('8', '8'), ('44', '4')]
For finding consecutive characters in a string, this is essentially the same that groupby will do. You can then filter out the combined match to get the desired result:
>>> [x for x, *_ in re.findall('((.)\\2*)', s)]
['222', '333', '444', '1', '2', '1', '4', '8', '44']
One solution without regex (that is not specific to digits) would be to use itertools.groupby():
>>> from itertools import groupby
>>> s = '2223334441214844'
>>> [''.join(g) for _, g in groupby(s)]
['222', '333', '444', '1', '2', '1', '4', '8', '44']
If you only need to extract consecutive identical digits, you may use a matching approach using r'(\d)\1*' regex:
import re
s='2223334441214844'
print([x.group() for x in re.finditer(r'(\d)\1*', s)])
# => ['222', '333', '444', '1', '2', '1', '4', '8', '44']
See the Python demo
Here,
(\d) - matches and captures into Group 1 any digit
\1* - a backreference to Group 1 matching the same value, 0+ repetitions.
This solution can be customized to match any specific consecutive chars (instead of \d, you may use \S - non-whitespace, \w - word, [a-fA-F] - a specific set, etc.). If you replace \d with . and use re.DOTALL modifier, it will work as the itertools solutions posted above.
Use a capture group and backreference.
str = '2223334441214844'
import re
print([i[0] for i in re.findall(r'((\d)\2*)', str)])
\2 matches whatever the (\d) capture group matched. The list comprehension is needed because when the RE contains capture groups, findall returns a list of the capture groups, not the whole match. So we need an extra group to get the whole match, and then need to extract that group from the result.
What about without importing any external module ?
You can create your own logic in pure python without importing any module Here is recursive approach,
string_1='2223334441214844'
list_2=[i for i in string_1]
def con(list_1):
group = []
if not list_1:
return 0
else:
track=list_1[0]
for j,i in enumerate(list_1):
if i==track[0]:
group.append(i)
else:
print(group)
return con(list_1[j:])
return group
print(con(list_2))
output:
['2', '2', '2']
['3', '3', '3']
['4', '4', '4']
['1']
['2']
['1']
['4']
['8']
['4', '4']

using python 3.6 to slice substring with same char [duplicate]

I am not well experienced with Regex but I have been reading a lot about it. Assume there's a string s = '111234' I want a list with the string split into L = ['111', '2', '3', '4']. My approach was to make a group checking if it's a digit or not and then check for a repetition of the group. Something like this
L = re.findall('\d[\1+]', s)
I think that \d[\1+] will basically check for either "digit" or "digit +" the same repetitions. I think this might do what I want.
Use re.finditer():
>>> s='111234'
>>> [m.group(0) for m in re.finditer(r"(\d)\1*", s)]
['111', '2', '3', '4']
If you want to group all the repeated characters, then you can also use itertools.groupby, like this
from itertools import groupby
print ["".join(grp) for num, grp in groupby('111234')]
# ['111', '2', '3', '4']
If you want to make sure that you want only digits, then
print ["".join(grp) for num, grp in groupby('111aaa234') if num.isdigit()]
# ['111', '2', '3', '4']
Try this one:
s = '111234'
l = re.findall(r'((.)\2*)', s)
## it this stage i have [('111', '1'), ('2', '2'), ('3', '3'), ('4', '4')] in l
## now I am keeping only the first value from the tuple of each list
lst = [x[0] for x in l]
print lst
output:
['111', '2', '3', '4']
If you don't want to use any libraries then here's the code:
s = "AACBCAAB"
L = []
temp = s[0]
for i in range(1,len(s)):
if s[i] == s[i-1]:
temp += s[i]
else:
L.append(temp)
temp = s[i]
if i == len(s)-1:
L.append(temp)
print(L)
Output:
['AA', 'C', 'B', 'C', 'AA', 'B']

Capturing multiple optional groups in a regex both repeating and non repeating

I have to match an expression similar to these
STAR 13
STAR 13, 23
STAR 1, 2 and 3 and STAR 1
But only capture the digits.
The number of digits is unspecified.
I've tried with STAR(?:\s*(?:,|and)\s*(#\d+))+
But it doesn't seem to capture the terms exactly.
No other dependencies could be added. Just the re module only.
The problem is a much larger one where STAR is another regular expression which has already been solved. Please don't bother about it and just consider it as a letter combination. Just include the letters STAR in regular expressions.
If you don't know the number of the digit r'[0-9]+' to specifie 1 digit or more. And to capture all number, you can use : r'(\d+)'
Do it with one regex:
re.findall("STAR ([0-9]+),? ?([0-9]+)? ?a?n?d? ?([0-9]+)?",a)
[('13', '', '')]
[('13', '23', '')]
[('1', '2', '3'), ('1', '', '')]
May be esaier and cleaner resultut with two step, first you need to have variable in a list like that:
tab = ["STAR 13","STAR 13, 23","STAR 1, 2 and 3 and STAR 1"]
list = filter(lambda x: re.match("^STAR",x),tab)
list_star = filter(lambda x: re.match("^STAR",x),tab)
for i in list_star:
re.findall(r'\d+', i)
['13']
['13', '23']
['1', '2', '3', '1']
You just need to put it in a new list after that my_digit += re.findall(r'\d+', i)
In 1 line:
import functools
tab = ["STAR 13","STAR 13, 23","STAR 1, 2 and 3 and STAR 1"]
digit=functools.reduce(lambda x,y: x+re.findall("\d+",y),filter(lambda x: re.match("^STAR ",x),tab),[])
['13', '13', '23', '1', '2', '3', '1']

Categories

Resources