Python – Debugging and visualizing a regex [duplicate] - python

While doing a regex pattern match, we get the content which has been a match. What if I want the pattern which was found in the content?
See the below example:
>>> import re
>>> r = re.compile('ERP|Gap', re.I)
>>> string = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
>>> r.findall(string)
['ERP', 'GAP', 'erp', 'ErP']
but I want the output to look like this : ['ERP', 'Gap', 'ERP', 'ERP']
Because if I do a group by and sum on the original output, I would get the following output as a dataframe:
ERP 1
erp 1
ErP 1
GAP 1
gap 1
But what if I want the output to look like
ERP 3
Gap 2
in par with the keywords I am searching for?
MORE CONTEXT
I have a keyword list like this: ['ERP', 'Gap']. I have a string like this: "ERP, erp, ErP, GAP, gap"
I want to take count of number of times each keyword has appeared in the string. Now if I am doing a pattern matching, I am getting the following output: [ERP, erp, ErP, GAP, gap].
Now if I want to aggregate and take a count, I am getting the following dataframe:
ERP 1
erp 1
ErP 1
GAP 1
gap 1
While I want the output to look like this:
ERP 3
Gap 2

You may build the pattern dynamically to include indices of the words you search for in the group names and then grab those pattern parts that matched:
import re
words = ["ERP", "Gap"]
words_dict = { f'g{i}':item for i,item in enumerate(words) }
rx = rf"\b(?:{'|'.join([ rf'(?P<g{i}>{item})' for i,item in enumerate(words) ])})\b"
text = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
results = []
for match in re.finditer(rx, text, flags=re.IGNORECASE):
results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )
print(results) # => ['ERP', 'Gap', 'ERP', 'ERP']
See the Python demo online
The pattern will look like \b(?:(?P<g0>ERP)|(?P<g1>Gap))\b:
\b - a word boundary
(?: - start of a non-capturing group encapsulating pattern parts:
(?P<g0>ERP) - Group "g0": ERP
| - or
(?P<g1>Gap) - Group "g1": Gap
) - end of the group
\b - a word boundary.
See the regex demo.
Note [0] with [words_dict.get(key) for key,value in match.groupdict().items() if value][0] will work in all cases since when there is a match, only one group matched.

Refer comments above.
Try:
>>> [x.upper() for x in r.findall(string)]
['ERP', 'GAP', 'ERP', 'ERP']
>>>
OR
>>> map(lambda x: x.upper(), r.findall(string))
['ERP', 'GAP', 'ERP', 'ERP']
>>>

Related

Split string and replace Discord emoji to [name]

I have incoming messages such as <a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes: and I want this output to be [GG] [1Copy][14][eyes]Hello friend![eyes]
The below code is what I currently have and it works kind of. The above incoming example outputs [GG] [1Copy] [14] [eyes]
def shorten_emojis(content):
seperators = ("<a:", "<:")
output = []
for chunk in content.split():
if any(match in chunk for match in seperators):
parsed_chunk = []
new_chunk = chunk.replace("<", ";<").replace(">", ">;")
for emo in new_chunk.split(";"):
if emo.startswith(seperators):
emo = f"<{splits[1]}>" if len(splits := emo.split(":")) == 3 else emo
parsed_chunk.append(emo)
chunk = "".join(parsed_chunk)
output.append(chunk)
output = " ".join(output)
for e in re.findall(":.+?:", content):
output = output.replace(e, f"<{e.replace(':', '')}>")
return output
Test #1
Input: <a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:
Output: [GG] [1Copy] [14] :eyes:Hello friend!:eyes:
Desired [GG] [1Copy][14][eyes]Hello friend![eyes]
Test #2
Input: <a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:
Output: [cryLaptop] [1Copy] [14] [thoonk] [coolbutdepressed] [KL1Heart] Nice [dogwonder] OK:eyes:
Desired [cryLaptop] [GG] [1Copy] [14] [thoonk] [coolbutdepressed] [KL1Heart] Nice [dogwonder] OK[eyes]
Edit
I have edited my code block, it now works as desired.
You might use a single pattern with an alternation | to match both variations. Then in the callback of sub, you can check for the existence of group 1.
<a?:([^:<>]+)[^<>]*>|:([^:]+):
The pattern matches
<a?: Match <, optional a and :
([^:<>]+) Capture in group 1 any char except : < and >
[^<>]*> Optionally match any char except < and >, then match >
| Or
:([^:]+): Capture in group 2 all between :
See a regex demo and a Python demo.
For example
import re
pattern = r"<a?:([^:<>]+)[^<>]*>|:([^:]+):"
def shorten_emojis(content):
return re.sub(
pattern, lambda x: f"[{x.group(1)}]" if x.group(1) else f"[{x.group(2)}]"
,content
)
print(shorten_emojis("<a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:"))
print(shorten_emojis("<a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:"))
Output
[GG] [1Copy][14][eyes]Hello friend![eyes]
[cryLaptop][1Copy][14] [thoonk][coolbutdepressed][KL1Heart]Nice[dogwonder] OK[eyes]
You can do this with regular expressions. It is a library that already includes Python itself.
I have modified the code a bit to make it more compact but I think it is understood the same.
The most important thing is to detect the three groups of words. With (<. *?>) We select the <words>, with (:. *? :) the : word: and with (. *?) The rest of the text.
Then we must format it with the expected values and display them.
import re
def shorten_emojis(content):
tags = re.findall('((<.*?>)|(:.*?:)||(.*?))', content)
output=""
for tag in tags:
if re.findall("<.*?>", tag[0]):
valor=re.search(':.*?:', tag[0])
output+=f"[{valor.group()[1:-1]}]"
elif re.match(":.*?:", tag[0]):
output+=f"[{tag[0][1:-1]}]"
else:
output+=f"{tag[0]}"
return output
print(shorten_emojis("<a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:"))
print(shorten_emojis("<a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:"))
RESULT:
[GG] [1Copy][14][eyes]Hello friend![eyes]
[cryLaptop][1Copy][14] [thoonk][coolbutdepressed][KL1Heart]Nice[dogwonder] OK[eyes]

Python Conditional Split

Given this string:
s = '01/03/1988 U/9 Mi\n08/19/1966 ABC\nDEF\n12/31/1999 YTD ABC'
I want to split it on each new record (which starts with a date) like this:
['01/03/1988 U/9 Mi', '08/19/1966 ABC\nDEF', '12/31/1999 YTD ABC']
Notice the extra new line delimiter between ABC and DEF? That's the challenge I'm having. I want to preserve it without a split there.
I'm thinking I need to conditionally split on any delimiter of these:
['01/', '02/','03/', '04/', '05/', '06/', '07/', '08/', '09/', '10/', '11/', '12/']
Is there an easy way to use re.findall this way or is there a better approach?
Thanks in advance!
You could split on the new line that is followed by a date with a lookahead. Something like:
import re
s = '01/03/1988 U/9 Mi\n08/19/1966 ABC\nDEF\n12/31/1999 YTD ABC'
re.split(r'\n(?=\d{2}/\d{2}/\d{4})', s)
# ['01/03/1988 U/9 Mi', '08/19/1966 ABC\nDEF', '12/31/1999 YTD ABC']
You may be able to simplify to just a newline followed by 2 digits depending on your data: r'\n(?=\d{2})'
Use regex instead.
code
import re
s = '01/03/1988 U/9 Mi\n08/19/1966 ABC\nDEF\n12/31/1999 YTD ABC'
chunks = re.compile(r'[\n](?=\d\d/\d\d/\d\d\d\d)').split(s)
print(chunks)
output
['01/03/1988 U/9 Mi', '08/19/1966 ABC\nDEF', '12/31/1999 YTD ABC']
You can also match a more specific date like format without lookarounds.
^(?:0[1-9]|1[012])/(?:0[1-9]|[12]\d|3[01])/(?:19|20)\d\d\b.*$
^ Start of string
(?:0[1-9]|1[012]) Match a month number from 01 - 12
/ Match literally
(?:0[1-9]|[12]\d|3[01]) Match a number 01 - 31
/ Match literally
(?:19|20)\d\d Match either 19 or 20 and 2 digits (Or just 4 digits \d{4})
\b.* A word boundary and match the rest of the line
$ End of string
Regex demo | Python demo
Example code
import re
s = '01/03/1988 U/9 Mi\n08/19/1966 ABC\nDEF\n12/31/1999 YTD ABC'
regex = r'^(?:0[1-9]|1[012])/(?:0[1-9]|[12]\d|3[01])/(?:19|20)\d\d\b.*$'
print(re.findall(regex, s, re.MULTILINE))
Output
['01/03/1988 U/9 Mi', '08/19/1966 ABC', '12/31/1999 YTD ABC']

Find matching similar keywords in Python Dataframe

joined_Gravity1.head()
Comments
____________________________________________________
0 Why the old Pike/Lyrik?
1 This is good
2 So clean
3 Looks like a Decoy
Input: type(joined_Gravity1)
Output: pandas.core.frame.DataFrame
The following code allows me to select strings that contain keywords: "ender"
joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]
Output:
Comments
___________________________
194 We need a new Sender 😂
7 What about the sender
179 what about the sender?😏
How to revise the code to include words similar to 'Sender' such as 'snder','bnder'?
I don't see a reason why regex=True inside the contains function won't work here.
joined_Gravity1[joined_Gravity1["Comments"].str.contains(pat="ender|snder|bndr", na=False, regex=True)]
I have used "ender|snder|bnder" only. You can make a list of all such words say list_words, and pass in pat='|'.join(list_words) in contains function above.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
There can be a massive number of possibilities that can occur with combinations of alphabets in such words. What you are trying to do is a fuzzy match between 2 string. I can recommend using the following -
#!pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process
word = 'sender'
others = ['bnder', 'snder', 'sender', 'hello']
process.extractBests(word, others)
[('sender', 100), ('snder', 91), ('bnder', 73), ('hello', 18)]
Based on this you can decide which threshold to choose and then mark the ones that are above the threshold as a match (using the code you used above)
Here is a method to do this in your exact problem statement with a function -
df = pd.DataFrame(['hi there i am a sender',
'I dont wanna be a bnder',
'can i be the snder?',
'i think i am a nerd'], columns=['text'])
#s = sentence, w = match word, t = match threshold
def get_match(s,w,t):
ss = process.extractBests(w,s.split())
return any([i[1]>t for i in ss])
#What its doing - Match each word in each row in df.text with
#the word sender and see of any of the words have a match greater
#than threshold ratio 70.
df['match'] = df['text'].apply(get_match, w='sender', t=70)
print(df)
text match
0 hi there i am a sender True
1 I dont wanna be a bnder True
2 can i be the snder? True
3 i think i am a nerd False
Tweek the t value from 70 to 80 if you want more exact match or lower for more relaxed match.
Finally you can filter it out -
df[df['match']==True][['text']]
text
0 hi there i am a sender
1 I dont wanna be a bnder
2 can i be the snder?
from difflib import get_close_matches
def closeMatches(patterns, word):
print(get_close_matches(word, patterns))
list_patterns = joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]
word = 'Sender'
patterns = list_patterns
closeMatches(patterns, word)

Repeated regex groups of arbitrary number

I have this example text snippet
headline:
Status[apphmi]: blubb, 'Statustext1'
Main[apphmi]: bla, 'Maintext1'Main[apphmi]: blaa, 'Maintext2'
Popup[apphmi]: blaaa, 'Popuptext1'
and I want to extract the words within '', but sorted with the context (status, main, popup).
My current regex is (example at pythex.org):
headline:(?:\n +Status\[apphmi\]:.* '(.*)')*(?:\n +Main\[apphmi\]:.* '(.*)')*(?:\n +Popup\[apphmi\]:.* '(.*)')*
but with this I only get 'Maintext2' and not both. I don't know how to repeat the groups to an arbitrary number.
You can try with this:
r"(.*?]):(?:[^']*)'([^']*)'"g
Look here
Group1 and Group 2 for each match contains your key value pair
You can not merge the second match as one by using regex, once you get all the pairs... you can apply some programming here to merge duplicate keys as one.
Here I have used dictionary of list, if a key already exists in the dictionary then you should append the value to the list , otherwise insert a new key with a new list having the value.
This is how it should be done (tested in python 3+)
import re
d = dict()
regex = r"(.*?]):(?:[^']*)'([^']*)'"
test_str = ("headline: \n"
"Status[apphmi]: blubb, 'Statustext1'\n"
"Main[apphmi]: bla, 'Maintext1'Main[apphmi]: blaa, 'Maintext2'\n"
"Popup[apphmi]: blaaa, 'Popuptext1'")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
if match.group(1) in d:
d[match.group(1)].append(match.group(2))
else:
d[match.group(1)] = [match.group(2),]
print(d)
Output:
{
'Popup[apphmi]': ['Popuptext1'],
'Main[apphmi]': ['Maintext1', 'Maintext2'],
'Status[apphmi]': ['Statustext1']
}

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?
You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.
There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]
If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Categories

Resources