With regex only, how to match an exact number of consecutive repetitions of an arbitrary single token? For example, matching the "aaa" in “ttaaabbb” instead of the "aaaa" “ttaaaabbb”, given the desired number of repetitions is 3.
Clarification: Note I was using "a" for an example, the token can be arbitrary character/number/symbols. That is, given the desired number of repetitions is 3, the desired match of "aaaa**!!!cccc333**" only gives "!!!" and "333".
In short, I want to find a list of tokens "X" where YXXXY appeared in the given string (Y is some other tokens that are different from X, Y can also be the start of the string or the end of the string). Note there can be repeated tokens in the list, e.g., "aaabbbbaaa" should give ["a", "a"].
Some other examples:
Input: "aaabbbbbbaaa****ccc", output: ["a", "a", "c"]
Input: "!!! aaaabbbaaa ccc!!!", output: ["!", "b", "a", "c", "!"].
What I have tried: I tried (.)\1{2} but unfortunately, it matches "aaaa" and "ccccc" as well in the example above. I further changed it to (?!\1)(.)\1{2}(?!\1) such that the prefix and postfix of the repeating pattern differ from it. However, I got an error in this case since the first \1 is undefined when being referred to.
You might use a pattern with 2 capture groups and a repeated backreference.
First match 4 or more times the same repeated character that you want to avoid, then match 3 times the same character.
The single characters that you want are in capture group 2, which you can get using re.finditer for example.
(\S)\1{3,}|(\S)\2{2}
The pattern matches:
(\S)\1{3,} Capture group 1, match a non whitespace char and repeat the backreference 3 or more times
| Or
(\S)\2{2} Capture group 2, match a non whitespace char and repeat the backreference 2 times
Regex demo | Python demo
For example:
import re
strings = [
"aaaa**!!!cccc333**",
"aaabbbbaaa",
"aaabbbbbbaaa****ccc",
"!!! aaaabbbaaa ccc!!!"
]
pattern = r"(\S)\1{3,}|(\S)\2{2}"
for s in strings:
matches = re.finditer(pattern, s)
result = []
for matchNum, match in enumerate(matches, start=1):
if match.group(2):
result.append(match.group(2))
print(result)
Output
['!', '3']
['a', 'a']
['a', 'a', 'c']
['!', 'b', 'a', 'c', '!']
You can do something like this using a regex and a loop:
def exact_re_match(string, length):
regex = re.compile(r'(.)\1*')
for match in regex.finditer(string):
elm = match.group()
if len(elm) == length:
yield elm
string = "aaaa!!!cccc333"
out = list(exact_re_match(string, 3))
print(out)
# ['!!!', '333']
Related
I got a badly managed log, and need to extract into a dictionary using Python.
# Pattern: (keys are not kw1, kw2 ,etc... no pattern in key)
"para1=a, kw2=b, (b, b=b), bb, kw3=c, t4=..."
# where
# - para1=a
# - kw2=b, (b, b=b), bb
# - kw3=c
# - and so on
# extract into a dict:
out = {"para1": "a", "kw2": "b, (b, b=b), bb", "kw3": "c", "t4": ...}
# Notes several important features
'''
1. all 'kw=value' are joined with certain spliter: ', '
2. value and kw themselves may contain spliter. i.e. 'f(x, y)=3, f(x=3, y=2, z=1)=g(x=1, t=2)'
3. all brackets must be in pair, (therefore we can identify spliters in kw or value).
4. all message must be part of kw or value.
'''
Q1: Is there a regex expression(or some Python code) that helps me get above key and value?
There's no pattern in key, kwN is just a reference to key
Q2Update: Thanks to Laurent, I alr know why Q2 doesn't work: Got unexpected result. ', (.*?)=' should give me the shortest matching between ',' and '=' right?
msg = 'a, a, b=b, c=c'
re.findall(', (.*?)=', msg)
>>> ['a, b', 'c']
# I was expecting ['b','c']
# shouldn't ', (.*?)=' give me the shortest matching between ',' and '='? which is 'b' instead of 'a, b'
(New)Q3: Since I'm working with huge loads of data, working efficiency is my first priority. I've worked out a python code which could achieve the goal, but it doesnt feel quick enough, could you help me to make it better?
def my_not_efficient_solution(msg):
'''
Notes:
1. all 'kw=value' are joined with certain spliter: ', '
2. value and kw themselves may contain spliter. i.e. 'f(x, y)=3, f(x=3, y=2, z=1)=g(x=1, t=2)'
3. all brackets must be in pair, (therefore we can identify spliters in kw or value).
4. all message must be part of kw or value.
Solution:
1. split message with spliter -> get entries
2. check each spliter bracekt and equal sign
3. for each entry: append to last one or serve as part of next one or good with itself
'''
spliter=', '
eq_sign=['=']
first=False
bracket_map={'(':1,")":-1,"[":1,"]":-1}
pair_chk_func = lambda s: not sum([bracket_map.get(i,0) for i in s])
eq_chk_func = lambda s: sum([i in s for i in eq_sign])
assert pair_chk_func(msg), 'msg pair check fail.'
res = msg.split(spliter)
# step1: split entry
entries=[]
do_pre='' # last entry is not complete(lack bracket)
do_first = '' # last entry is not complete(lack eq sign)
while res.__len__()>0:
if first and entries.__len__()==2:
entries.pop(-1)
break
if do_first and entries._len__()==0:
do=do_first+res.pop(0)
else:
do_first=''
do=res.pop(0)
eq_chk=eq_chk_func(do_pre+do)
pair_chk=pair_chk_func(do_pre+do)
# case1: not valid entry, no eq sign
# case2: previous entry not complete
# case3: current entry not valid(no eq sign, will drop) and pair incomplete(may be part of next entry)
if not eq_chk or do_pre:
if entries.__len__() > 0:
entries[-1]+=spliter+do
pair_chk=pair_chk_func(entries[-1])
if pair_chk: do_pre=''
else: do_pre=entries[-1]
elif not pair_chk:
do_first=do
# case4: current entry good to go
elif eq_chk and pair_chk:
entries.append(do)
do_pre=''
# case5: current entry not complete(pair not complete)
else:
entries.append(do)
do_pre=do
# step2: split each into dict
output={}
split_mark = '|'.join(eq_sign)
for entry in entries:
splits=re.split(split_mark, entry)
if splits.__len__()<2:
raise ValueError('split fail for message')
kw = splits.pop(0)
while not pair_chk_func(kw):
kw += '='+splits.pop(0)
output[kw]='='.join(splits)
return output
msg = 'B_=a, kw2=b, f(A=3, k=2)=g(t=3, v=5), mark[(blabla), f(xx tt)=33]'
my_not_efficient_solution(msg)
>>> {'B_': 'a',
'kw2': 'b',
'f(A=3, k=2)': 'g(t=3, v=5), mark[(blabla), f(xx tt)=33]'}
Answer to Q1:
Here is my suggestion:
import re
s = "kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=..."
pattern = r'(?=(kw.)=(.*?)(?:, kw.=|$))'
result = dict(re.findall(pattern, s))
print(result) # {'kw1': 'a', 'kw2': 'b, (b, b=b), bb', 'kw3': 'c', 'kw4': '...'}
To explain the regex:
the (?=...) is a lookahead assertion to let you find overlapping matches
the ? in (.*?) makes the quantifier * (asterisk) non-greedy
the ?: makes the group (?:, kw.=|$) non-capturing
the |$ at the end allows to take account of the last value in your string
Answer to Q2:
No, this is wrong. The quantifier *? is non-greedy, so it finds the first match. Moreover there is no search for overlapping matches , which could be done with (?=...). So your observed result is the expected done.
I may suggest you this simple solution:
msg = 'a, a, b=b, c=c'
result = re.findall(', ([^,]*?)=', msg)
print(result) # ['b', 'c']
Q1: Is there a regex expression that helps me get above key and value?
To get the key:value in a dictionary format you can use
Say your string is
"kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=dd, kw10=jndn"
Using the following regex gives you key and values in a list
results = re.findall(r'(\bkw\d+)=(.*?)(?=,+\s*\bkw\d+=|$)', s)
[('kw1', 'a'), ('kw2', 'b, (b, b=b), bb'), ('kw3', 'c'), ('kw4', 'dd'), ('kw10', 'jndn')]
You can convert it to a dictionary as
dict(results)
Output :
{
'kw1': 'a',
'kw2': 'b, (b, b=b), bb',
'kw3': 'c',
'kw4': 'dd',
'kw10': 'jndn'
}
Explanation :
\b is used like a word boundary and will only match kw and not something like XYZkw
\kw\d+= Match the word kw followed by 1+ digits and =
.*? (Lazy Match) Match as least chars as possible
(?= Positive lookahead, assert to the right
\s*\bkw\d+= Match optional whitespace chars, then pat, 1+ digits and =
| Or
$ Assert the end of the string for the last part
) Close the lookahead
I want to write a regex expression for words with even-numbered length.
For example, the output I want from the list containing the words:
{"blue", "ah", "sky", "wow", "neat"} is {"blue", "ah", "neat}.
I know that the expression \w{2} or \w{4} would produce 2-worded or 4-worded words, but what I want is something that could work for all even numbers. I tried using \w{%2==0} but it doesn't work.
You can repeat 2 word characters as a group between anchors ^ to assert the start and $ to assert the end of the string, or between word boundaries \b
^(?:\w{2})+$
See a regex demo.
import re
strings = [
"blue",
"ah",
"sky",
"wow",
"neat"
]
for s in strings:
m = re.match(r"(?:\w{2})+$", s)
if m:
print(m.group())
Output
blue
ah
neat
If you need no extra validation for the strings in your set, you can simply use
words = {"blue", "ah", "sky", "wow", "neat"}
print( list(w for w in words if len(w) % 2 == 0) )
# => ['ah', 'blue', 'neat']
See this Python demo.
If you want to make sure the words you return are made of letters, you can use
import re
words = {"blue", "ah", "sky", "wow", "neat"}
rx = re.compile(r'(?:[^\W\d_]{2})+') # For any Unicode letter words
# rx = re.compile(r'(?:[a-zA-Z]{2})+') # For ASCII only letter words
print( [w for w in words if rx.fullmatch(w)] )
# => ['blue', 'ah', 'neat']
See this Python demo. A (?:[^\W\d_]{2})+ pattern matches one or more occurrences of any two Unicode letters. Together with re.fullmatch, it requires a string to consist of an even amount of letters.
I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??
Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']
This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.
For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']
In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']
There are multiple space separated characters in the input eg: string = "a b c d a s e "
What should the pattern be such that when I do re.search on the input using the pattern, I'd get the j'th character along with the space following it in the input by using .group(j)?
I tried something of the sort "^(([a-zA-Z])\s)+" but this is not working. What should I do?
EDIT
My actual question is in the heading and the body described only a special case of it:
Here's the general version of the question: if I have to take in all patterns of a specific type (initial question had the pattern "[a-zA-Z]\s") from a string, what should I do?
Use findall() instead and get the j-th match by index:
>>> j = 2
>>> re.findall(r"[a-zA-Z]\s", string)[j]
'c '
where [a-zA-Z]\s would match a lower or upper case letter followed by a single space character.
Why use regex when you can simply use str.split() method and access to the characters with a simple indexing?
>>> new = s.split()
>>> new
['a', 'b', 'c', 'd', 'a', 's', 'e']
You could do:
>>> string = "a b c d a s e "
>>> j=2
>>> re.search(r'([a-zA-Z]\s){%i}' % j, string).group(1)
'b '
Explanation:
With the pattern ([a-zA-Z]\s) you capture a letter then the space;
With the repetition {2} added, you capture the last of the repetition -- in this case the second one (base 1 vs base 0 indexing...).
Demo
How can I split a string into substrings based on the characters contained in the substrings. For example, given a string "ABC12345..::", I would like to get a list like ['ABC', '12345', '..::']. I know the valid characters for each substring, but I don't know the lengths. So the string could also look like "CC123:....:", in which case I would like to have ['CC', '123', ':....:'] as the result.
By your example you don't seem to have anything to split with (e.g. nothing between C and 1), but what you do have is a well-formed pattern that you can match. So just simply create a pattern that groups the strings you want matched:
>>> import re
>>> s = "ABC12345..::"
>>> re.match('([A-Z]*)([0-9]*)([\.:]*)', s).groups()
('ABC', '12345', '..::')
Alternative, compile the pattern into a reusable regex object and do this:
>>> patt = re.compile('([A-Z]*)([0-9]*)([\.:]*)')
>>> patt.match(s).groups()
('ABC', '12345', '..::')
>>> patt.match("CC123:....:").groups()
('CC', '123', ':....:')
Match each group with the following regex
[0-9]+|[a-zA-Z]+|[.:]+
[0-9]+ any digits repeated any times, or
[a-zA-Z]+ any letters repeated any times, or
[.:]+ any dots or colons repeated any times
This will allow you to match groups in any order, ie: "123...xy::ab..98765PQRS".
import re
print(re.findall( r'[0-9]+|[a-zA-Z]+|[.:]+', "ABC12345..::"))
# => ['ABC', '12345', '..::']
ideone demo
If you want a non-regex approach:
value = 'ABC12345..::'
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['ABC', '12345', '..::']
Another string:
value = "CC123:....:"
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['CC', '123', ':....:']
EDIT:
Just did a benchmark, metatoaster's method is slightly faster than this :)