Extract Information with brackets using python - python

I got a badly managed log, and need to extract into a dictionary using Python.
# Pattern: (keys are not kw1, kw2 ,etc... no pattern in key)
"para1=a, kw2=b, (b, b=b), bb, kw3=c, t4=..."
# where
# - para1=a
# - kw2=b, (b, b=b), bb
# - kw3=c
# - and so on
# extract into a dict:
out = {"para1": "a", "kw2": "b, (b, b=b), bb", "kw3": "c", "t4": ...}
# Notes several important features
'''
1. all 'kw=value' are joined with certain spliter: ', '
2. value and kw themselves may contain spliter. i.e. 'f(x, y)=3, f(x=3, y=2, z=1)=g(x=1, t=2)'
3. all brackets must be in pair, (therefore we can identify spliters in kw or value).
4. all message must be part of kw or value.
'''
Q1: Is there a regex expression(or some Python code) that helps me get above key and value?
There's no pattern in key, kwN is just a reference to key
Q2Update: Thanks to Laurent, I alr know why Q2 doesn't work: Got unexpected result. ', (.*?)=' should give me the shortest matching between ',' and '=' right?
msg = 'a, a, b=b, c=c'
re.findall(', (.*?)=', msg)
>>> ['a, b', 'c']
# I was expecting ['b','c']
# shouldn't ', (.*?)=' give me the shortest matching between ',' and '='? which is 'b' instead of 'a, b'
(New)Q3: Since I'm working with huge loads of data, working efficiency is my first priority. I've worked out a python code which could achieve the goal, but it doesnt feel quick enough, could you help me to make it better?
def my_not_efficient_solution(msg):
'''
Notes:
1. all 'kw=value' are joined with certain spliter: ', '
2. value and kw themselves may contain spliter. i.e. 'f(x, y)=3, f(x=3, y=2, z=1)=g(x=1, t=2)'
3. all brackets must be in pair, (therefore we can identify spliters in kw or value).
4. all message must be part of kw or value.
Solution:
1. split message with spliter -> get entries
2. check each spliter bracekt and equal sign
3. for each entry: append to last one or serve as part of next one or good with itself
'''
spliter=', '
eq_sign=['=']
first=False
bracket_map={'(':1,")":-1,"[":1,"]":-1}
pair_chk_func = lambda s: not sum([bracket_map.get(i,0) for i in s])
eq_chk_func = lambda s: sum([i in s for i in eq_sign])
assert pair_chk_func(msg), 'msg pair check fail.'
res = msg.split(spliter)
# step1: split entry
entries=[]
do_pre='' # last entry is not complete(lack bracket)
do_first = '' # last entry is not complete(lack eq sign)
while res.__len__()>0:
if first and entries.__len__()==2:
entries.pop(-1)
break
if do_first and entries._len__()==0:
do=do_first+res.pop(0)
else:
do_first=''
do=res.pop(0)
eq_chk=eq_chk_func(do_pre+do)
pair_chk=pair_chk_func(do_pre+do)
# case1: not valid entry, no eq sign
# case2: previous entry not complete
# case3: current entry not valid(no eq sign, will drop) and pair incomplete(may be part of next entry)
if not eq_chk or do_pre:
if entries.__len__() > 0:
entries[-1]+=spliter+do
pair_chk=pair_chk_func(entries[-1])
if pair_chk: do_pre=''
else: do_pre=entries[-1]
elif not pair_chk:
do_first=do
# case4: current entry good to go
elif eq_chk and pair_chk:
entries.append(do)
do_pre=''
# case5: current entry not complete(pair not complete)
else:
entries.append(do)
do_pre=do
# step2: split each into dict
output={}
split_mark = '|'.join(eq_sign)
for entry in entries:
splits=re.split(split_mark, entry)
if splits.__len__()<2:
raise ValueError('split fail for message')
kw = splits.pop(0)
while not pair_chk_func(kw):
kw += '='+splits.pop(0)
output[kw]='='.join(splits)
return output
msg = 'B_=a, kw2=b, f(A=3, k=2)=g(t=3, v=5), mark[(blabla), f(xx tt)=33]'
my_not_efficient_solution(msg)
>>> {'B_': 'a',
'kw2': 'b',
'f(A=3, k=2)': 'g(t=3, v=5), mark[(blabla), f(xx tt)=33]'}

Answer to Q1:
Here is my suggestion:
import re
s = "kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=..."
pattern = r'(?=(kw.)=(.*?)(?:, kw.=|$))'
result = dict(re.findall(pattern, s))
print(result) # {'kw1': 'a', 'kw2': 'b, (b, b=b), bb', 'kw3': 'c', 'kw4': '...'}
To explain the regex:
the (?=...) is a lookahead assertion to let you find overlapping matches
the ? in (.*?) makes the quantifier * (asterisk) non-greedy
the ?: makes the group (?:, kw.=|$) non-capturing
the |$ at the end allows to take account of the last value in your string
Answer to Q2:
No, this is wrong. The quantifier *? is non-greedy, so it finds the first match. Moreover there is no search for overlapping matches , which could be done with (?=...). So your observed result is the expected done.
I may suggest you this simple solution:
msg = 'a, a, b=b, c=c'
result = re.findall(', ([^,]*?)=', msg)
print(result) # ['b', 'c']

Q1: Is there a regex expression that helps me get above key and value?
To get the key:value in a dictionary format you can use
Say your string is
"kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=dd, kw10=jndn"
Using the following regex gives you key and values in a list
results = re.findall(r'(\bkw\d+)=(.*?)(?=,+\s*\bkw\d+=|$)', s)
[('kw1', 'a'), ('kw2', 'b, (b, b=b), bb'), ('kw3', 'c'), ('kw4', 'dd'), ('kw10', 'jndn')]
You can convert it to a dictionary as
dict(results)
Output :
{
'kw1': 'a',
'kw2': 'b, (b, b=b), bb',
'kw3': 'c',
'kw4': 'dd',
'kw10': 'jndn'
}
Explanation :
\b is used like a word boundary and will only match kw and not something like XYZkw
\kw\d+= Match the word kw followed by 1+ digits and =
.*? (Lazy Match) Match as least chars as possible
(?= Positive lookahead, assert to the right
\s*\bkw\d+= Match optional whitespace chars, then pat, 1+ digits and =
| Or
$ Assert the end of the string for the last part
) Close the lookahead

Related

Dynamically replace all starting and ending letters of words in a sentence by using regex plus a dictionary or Hash map in python

I'm looking for a way to create a function that dynamically replaces all the initial or beginning letters of words in a sentence. I created a function that replaces the initial letters no problem.
def replace_all_initial_letters(original, new, sentence):
new_string = re.sub(r'\b'+original, new, sentence)
return new_string
test_sentence = 'This was something that had to happen again'
print(replace_all_initial_letters('h', 'b', test_sentence))
Output: 'This was something that bad to bappen again'
I would however like to be able to enter multiple options into this function using a dictionary or Hash Map. For example like using the following:
initialLetterConversion = {
'r': 'v',
'h': 'b'
}
Or I think there might be a way to do this using regex grouping perhaps.
I'm also having trouble implementing this for ending letters. I tried the following function but it does not work
def replace_all_final_letters(original, new, sentence):
new_string = re.sub(original+r'/s', new, sentence)
return new_string
print(replace_all_final_letters('n', 'm', test_sentence))
Expected Output: 'This was something that had to happem agaim'
Any help would be greatly appreciated.
By "simple" grouping you can access to the match with the lastindex attribute. Notice that such indexes starts from 1. re.sub accept as second argument a callback to add more flexibility for custom substitutions. Here an example of usage:
import re
mapper = [
{'regex': r'\b(w)', 'replace_with': 'W'},
{'regex': r'\b(h)', 'replace_with': 'H'}]
regex = '|'.join(d['regex'] for d in mapper)
def replacer(match):
return mapper[match.lastindex - 1]['replace_with'] # mapper is globally defined
text = 'This was something that had to happen again'
out = re.sub(regex, replacer, text)
print(out)
#This Was something that Had to Happen again
Ignore this if, for some reason, re is required for this. This is plain Python with no need for any imports.
The conversion map is a list of 2-tuples. Each tuple has a from and to value. The from and to values are not limited to a string of length 1.
This single function handles both the beginning and end of words although the mapping is for both 'ends' of the word and may therefore need some adaptation.
sentence = 'This was something that had to happen again'
def func(sentence, conv_map):
words = sentence.split()
for i, word in enumerate(words):
for f, t in conv_map:
if word.startswith(f):
words[i] = t + word[len(f):]
word = words[i]
if word.endswith(f):
words[i] = word[:-len(f)] + t
return ' '.join(words)
print(func(sentence, [('h', 'b'), ('a', 'x'), ('s', 'y')]))
Output:
Thiy way yomething that bad to bappen xgain

Check for exact number of consecutive repetitions with a regex

With regex only, how to match an exact number of consecutive repetitions of an arbitrary single token? For example, matching the "aaa" in “ttaaabbb” instead of the "aaaa" “ttaaaabbb”, given the desired number of repetitions is 3.
Clarification: Note I was using "a" for an example, the token can be arbitrary character/number/symbols. That is, given the desired number of repetitions is 3, the desired match of "aaaa**!!!cccc333**" only gives "!!!" and "333".
In short, I want to find a list of tokens "X" where YXXXY appeared in the given string (Y is some other tokens that are different from X, Y can also be the start of the string or the end of the string). Note there can be repeated tokens in the list, e.g., "aaabbbbaaa" should give ["a", "a"].
Some other examples:
Input: "aaabbbbbbaaa****ccc", output: ["a", "a", "c"]
Input: "!!! aaaabbbaaa ccc!!!", output: ["!", "b", "a", "c", "!"].
What I have tried: I tried (.)\1{2} but unfortunately, it matches "aaaa" and "ccccc" as well in the example above. I further changed it to (?!\1)(.)\1{2}(?!\1) such that the prefix and postfix of the repeating pattern differ from it. However, I got an error in this case since the first \1 is undefined when being referred to.
You might use a pattern with 2 capture groups and a repeated backreference.
First match 4 or more times the same repeated character that you want to avoid, then match 3 times the same character.
The single characters that you want are in capture group 2, which you can get using re.finditer for example.
(\S)\1{3,}|(\S)\2{2}
The pattern matches:
(\S)\1{3,} Capture group 1, match a non whitespace char and repeat the backreference 3 or more times
| Or
(\S)\2{2} Capture group 2, match a non whitespace char and repeat the backreference 2 times
Regex demo | Python demo
For example:
import re
strings = [
"aaaa**!!!cccc333**",
"aaabbbbaaa",
"aaabbbbbbaaa****ccc",
"!!! aaaabbbaaa ccc!!!"
]
pattern = r"(\S)\1{3,}|(\S)\2{2}"
for s in strings:
matches = re.finditer(pattern, s)
result = []
for matchNum, match in enumerate(matches, start=1):
if match.group(2):
result.append(match.group(2))
print(result)
Output
['!', '3']
['a', 'a']
['a', 'a', 'c']
['!', 'b', 'a', 'c', '!']
You can do something like this using a regex and a loop:
def exact_re_match(string, length):
regex = re.compile(r'(.)\1*')
for match in regex.finditer(string):
elm = match.group()
if len(elm) == length:
yield elm
string = "aaaa!!!cccc333"
out = list(exact_re_match(string, 3))
print(out)
# ['!!!', '333']

Python Combining f-string with r-string and curly braces in regex

Given a single word (x); return the possible n-grams that can be found in that word.
You can modify the n-gram value according as you want;
it is in the curly braces in the pat variable.
The default n-gram value is 4.
For example; for the word (x):
x = 'abcdef'
The possible 4-gram are:
['abcd', 'bcde', 'cdef']
def ngram_finder(x):
pat = r'(?=(\S{4}))'
xx = re.findall(pat, x)
return xx
The Question is:
How to combine the f-string with the r-string in the regex expression, using curly braces.
You can use this string to combine the n value into your regexp, using double curly brackets to create a single one in the output:
fr'(?=(\S{{{n}}}))'
The regex needs to have {} to make a quantifier (as you had in your original regex {4}). However f strings use {} to indicate an expression replacement so you need to "escape" the {} required by the regex in the f string. That is done by using {{ and }} which in the output create { and }. So {{{n}}} (where n=4) generates '{' + '4' + '}' = '{4}' as required.
Complete code:
import re
def ngram_finder(x, n):
pat = fr'(?=(\S{{{n}}}))'
return re.findall(pat, x)
x = 'abcdef'
print(ngram_finder(x, 4))
print(ngram_finder(x, 5))
Output:
['abcd', 'bcde', 'cdef']
['abcde', 'bcdef']

Spliting on every character except for preserved substring

Given the string
word = "These"
that contains the tuple
pair = ("h", "e")
the aim is to replace the word such that it splits on all character except for the pair tuple, i.e. output:
('T', 'he', 's', 'e')
I've tried:
word = 'These'
pair = ('h', 'e')
first, second = pair
pair_str = ''.join(pair)
pair_str = pair_str.replace('\\','\\\\')
pattern = re.compile(r'(?<!\S)' + re.escape(first + ' ' + second) + r'(?!\S)')
new_word = ' '.join(word)
new_word = pattern.sub(pair_str, new_word)
result = tuple(new_word.split())
Note that sometimes the pair tuple can contain slashes, backslashes or any other escape characters, thus the replace and escape in the above regex.
Is there a simpler way to achieve the same string replacement?
EDITED
Specifics from comments:
And is there a distinction between when both characters in the pair are unique and when they aren't?
Nope, they should be treated the same way.
Match instead of splitting:
pattern = re.escape(''.join(pair)) + '|.'
result = tuple(re.findall(pattern, word))
The pattern is <pair>|., which matches the pair if possible and a single character* otherwise.
You can also do this without regular expressions:
import itertools
non_pairs = word.split(''.join(pair))
result = [(''.join(pair),)] * (2 * len(non_pairs) - 1)
result[::2] = non_pairs
result = tuple(itertools.chain(*result))
* It doesn’t match newlines, though; if you have those, pass re.DOTALL as a third argument to re.findall.
You can do it without using regular expressions:
import functools
word = 'These here when she'
pair = ('h', 'e')
digram = ''.join(pair)
parts = map(list, word.split(digram))
lex = lambda pre,post: post if pre is None else pre+[digram]+post
print(functools.reduce(lex, parts, None))

How to use * or + with brackets in regular expressions in Python?

There are multiple space separated characters in the input eg: string = "a b c d a s e "
What should the pattern be such that when I do re.search on the input using the pattern, I'd get the j'th character along with the space following it in the input by using .group(j)?
I tried something of the sort "^(([a-zA-Z])\s)+" but this is not working. What should I do?
EDIT
My actual question is in the heading and the body described only a special case of it:
Here's the general version of the question: if I have to take in all patterns of a specific type (initial question had the pattern "[a-zA-Z]\s") from a string, what should I do?
Use findall() instead and get the j-th match by index:
>>> j = 2
>>> re.findall(r"[a-zA-Z]\s", string)[j]
'c '
where [a-zA-Z]\s would match a lower or upper case letter followed by a single space character.
Why use regex when you can simply use str.split() method and access to the characters with a simple indexing?
>>> new = s.split()
>>> new
['a', 'b', 'c', 'd', 'a', 's', 'e']
You could do:
>>> string = "a b c d a s e "
>>> j=2
>>> re.search(r'([a-zA-Z]\s){%i}' % j, string).group(1)
'b '
Explanation:
With the pattern ([a-zA-Z]\s) you capture a letter then the space;
With the repetition {2} added, you capture the last of the repetition -- in this case the second one (base 1 vs base 0 indexing...).
Demo

Categories

Resources