I have incoming messages such as <a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes: and I want this output to be [GG] [1Copy][14][eyes]Hello friend![eyes]
The below code is what I currently have and it works kind of. The above incoming example outputs [GG] [1Copy] [14] [eyes]
def shorten_emojis(content):
seperators = ("<a:", "<:")
output = []
for chunk in content.split():
if any(match in chunk for match in seperators):
parsed_chunk = []
new_chunk = chunk.replace("<", ";<").replace(">", ">;")
for emo in new_chunk.split(";"):
if emo.startswith(seperators):
emo = f"<{splits[1]}>" if len(splits := emo.split(":")) == 3 else emo
parsed_chunk.append(emo)
chunk = "".join(parsed_chunk)
output.append(chunk)
output = " ".join(output)
for e in re.findall(":.+?:", content):
output = output.replace(e, f"<{e.replace(':', '')}>")
return output
Test #1
Input: <a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:
Output: [GG] [1Copy] [14] :eyes:Hello friend!:eyes:
Desired [GG] [1Copy][14][eyes]Hello friend![eyes]
Test #2
Input: <a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:
Output: [cryLaptop] [1Copy] [14] [thoonk] [coolbutdepressed] [KL1Heart] Nice [dogwonder] OK:eyes:
Desired [cryLaptop] [GG] [1Copy] [14] [thoonk] [coolbutdepressed] [KL1Heart] Nice [dogwonder] OK[eyes]
Edit
I have edited my code block, it now works as desired.
You might use a single pattern with an alternation | to match both variations. Then in the callback of sub, you can check for the existence of group 1.
<a?:([^:<>]+)[^<>]*>|:([^:]+):
The pattern matches
<a?: Match <, optional a and :
([^:<>]+) Capture in group 1 any char except : < and >
[^<>]*> Optionally match any char except < and >, then match >
| Or
:([^:]+): Capture in group 2 all between :
See a regex demo and a Python demo.
For example
import re
pattern = r"<a?:([^:<>]+)[^<>]*>|:([^:]+):"
def shorten_emojis(content):
return re.sub(
pattern, lambda x: f"[{x.group(1)}]" if x.group(1) else f"[{x.group(2)}]"
,content
)
print(shorten_emojis("<a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:"))
print(shorten_emojis("<a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:"))
Output
[GG] [1Copy][14][eyes]Hello friend![eyes]
[cryLaptop][1Copy][14] [thoonk][coolbutdepressed][KL1Heart]Nice[dogwonder] OK[eyes]
You can do this with regular expressions. It is a library that already includes Python itself.
I have modified the code a bit to make it more compact but I think it is understood the same.
The most important thing is to detect the three groups of words. With (<. *?>) We select the <words>, with (:. *? :) the : word: and with (. *?) The rest of the text.
Then we must format it with the expected values and display them.
import re
def shorten_emojis(content):
tags = re.findall('((<.*?>)|(:.*?:)||(.*?))', content)
output=""
for tag in tags:
if re.findall("<.*?>", tag[0]):
valor=re.search(':.*?:', tag[0])
output+=f"[{valor.group()[1:-1]}]"
elif re.match(":.*?:", tag[0]):
output+=f"[{tag[0][1:-1]}]"
else:
output+=f"{tag[0]}"
return output
print(shorten_emojis("<a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:"))
print(shorten_emojis("<a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:"))
RESULT:
[GG] [1Copy][14][eyes]Hello friend![eyes]
[cryLaptop][1Copy][14] [thoonk][coolbutdepressed][KL1Heart]Nice[dogwonder] OK[eyes]
Related
Can anyone help in fixing the issue here.
I am trying to extract GSTIN/UIN from texts.
#None of these works
#GSTIN_REG = re.compile(r'^\d{2}([a-z?A-Z?0-9]){5}([a-z?A-Z?0-9]){4}([a-z?A-Z?0-9]){1}?[Z]{1}[A-Z\d]{1}$')
#GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}Z{1}[A-Z0-9]{1}')
#GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}[Z]{1}[A-Z0-9]{1}$')
GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$')
#GSTIN_REG = re.compile(r'19AISPJ4698P1ZX') #This works
#GSTIN_REG = re.compile(r'06AACCE2308Q1ZK') #This works
def extract_gstin(text):
return re.findall(GSTIN_REG, text)
text = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(text))
Your second pattern in the commented out part works, and you can omit {1} as it is the default.
What you might do to make it a bit more specific is add word boundaries \b to the left and right to prevent a partial word match.
If it should be after GSTIN : you can use a capture group as well.
Example with the commented pattern:
import re
GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9]')
def extract_gstin(s):
return re.findall(GSTIN_REG, s)
s = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(s))
Output
['06AACCE2308Q1ZK']
A bit more specific pattern (which has the same output as re.findall returns the value of the capture group)
\bGSTIN : ([0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9])\b
Regex demo
While doing a regex pattern match, we get the content which has been a match. What if I want the pattern which was found in the content?
See the below example:
>>> import re
>>> r = re.compile('ERP|Gap', re.I)
>>> string = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
>>> r.findall(string)
['ERP', 'GAP', 'erp', 'ErP']
but I want the output to look like this : ['ERP', 'Gap', 'ERP', 'ERP']
Because if I do a group by and sum on the original output, I would get the following output as a dataframe:
ERP 1
erp 1
ErP 1
GAP 1
gap 1
But what if I want the output to look like
ERP 3
Gap 2
in par with the keywords I am searching for?
MORE CONTEXT
I have a keyword list like this: ['ERP', 'Gap']. I have a string like this: "ERP, erp, ErP, GAP, gap"
I want to take count of number of times each keyword has appeared in the string. Now if I am doing a pattern matching, I am getting the following output: [ERP, erp, ErP, GAP, gap].
Now if I want to aggregate and take a count, I am getting the following dataframe:
ERP 1
erp 1
ErP 1
GAP 1
gap 1
While I want the output to look like this:
ERP 3
Gap 2
You may build the pattern dynamically to include indices of the words you search for in the group names and then grab those pattern parts that matched:
import re
words = ["ERP", "Gap"]
words_dict = { f'g{i}':item for i,item in enumerate(words) }
rx = rf"\b(?:{'|'.join([ rf'(?P<g{i}>{item})' for i,item in enumerate(words) ])})\b"
text = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
results = []
for match in re.finditer(rx, text, flags=re.IGNORECASE):
results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )
print(results) # => ['ERP', 'Gap', 'ERP', 'ERP']
See the Python demo online
The pattern will look like \b(?:(?P<g0>ERP)|(?P<g1>Gap))\b:
\b - a word boundary
(?: - start of a non-capturing group encapsulating pattern parts:
(?P<g0>ERP) - Group "g0": ERP
| - or
(?P<g1>Gap) - Group "g1": Gap
) - end of the group
\b - a word boundary.
See the regex demo.
Note [0] with [words_dict.get(key) for key,value in match.groupdict().items() if value][0] will work in all cases since when there is a match, only one group matched.
Refer comments above.
Try:
>>> [x.upper() for x in r.findall(string)]
['ERP', 'GAP', 'ERP', 'ERP']
>>>
OR
>>> map(lambda x: x.upper(), r.findall(string))
['ERP', 'GAP', 'ERP', 'ERP']
>>>
Im learning regular expressions, specifically named capture groups.
Having an issue where I'm not able to figure out how to write an if/else statement for my function findVul().
Basically how the code works or should work is that findVul() goes through data1 and data2, which has been added to the list myDATA.
If the regex finds a match for the entire named group, then it should print out the results. It currently works perfectly.
CODE:
import re
data1 = '''
dwadawa231d .2 vulnerabilities discovered dasdfadfad .One vulnerability discovered 123e2121d21 .12 vulnerabilities discovered sgwegew342 dawdwadasf
2r3232r32ee
'''
data2 = ''' d21d21 .2 vul discovered adqdwdawd .One vulnerability disc d12d21d .two vulnerabilities discovered 2e1e21d1d f21f21
'''
def findVul(data):
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.finditer(pattern, data)
for x in match:
print(x.group())
myDATA = [data1,data2] count_data = 1
for x in myDATA:
print('\n--->Reading data{0}\n'.format(count_data))
count_data+=1
findVul(x)
OUTPUT:
--->Reading data1
2 vulnerabilities discovered
One vulnerability discovered
12 vulnerabilities discovered
--->Reading data2
Now I want to add an if/else statement to check if there are any matches for the entire named group.
I tried something like this, but it doesn't seem to be working.
CODE:
def findVul(data):
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.finditer(pattern, data)
if len(list(match)) != 0:
print('\nVulnerabilities Found!\n')
for x in match:
print(x.group())
else:
print('No Vulnerabilities Found!\n')
OUTPUT:
--->Reading data1
Vulnerabilities Found!
--->Reading data2
No Vulnerabilities Found!
As you can see it does not print the vulnerabilities that should be in data1.
Could someone please explain the correct way to do this and why my logic is wrong.
Thanks so much :) !!
The problem is that re.finditer() returns an iterator that is evaluated when you do the len(list(match)) != 0 test; when you iterate over it again in the for-loop, it is already exhausted and there are no items left. The simple fix is just to add a match = list(match) line after the finditer() call.
I did some more research after #AdamKG response.
I wanted to utlize the re.findall() function.
re.findall() will return a list of all matched substrings. In my case I have capture groups inside of my named capture group. This will return a list with tuples.
For example the following regex with data1:
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+
(vulnerabilities|vulnerability)\s+discovered)')
match = re.findall(pattern, data)
Will return a list with tuples:
[('2 vulnerabilities discovered', '2', 'vulnerabilities'), ('One vulnerability
discovered', 'One', 'vulnerability'), ('12 vulnerabilities discovered', '12',
'vulnerabilities')]
My Final Code for findVul():
pattern = re.compile(r'(?P<VUL>(\d{1,2}|One)\s+(vulnerabilities|vulnerability)\s+discovered)')
match = re.findall(pattern, data)
if len(match) != 0:
print('Vulnerabilties Found!\n')
for x in match:
print('--> {0}'.format(x[0]))
else:
print('No Vulnerability Found!\n')
I saw question here:
Regex to capture {}
which is similar to what I want, but I cannot get it to work.
My data is:
[Honda] Japanese manufacturer [VTEC] Name of electronic lift control
And I want the output to be
[Honda], [VTEC]
My expression is:
m = re.match('(\[[^\[\]]*\])', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
I would expect:
m.group(0) to output [Honda]
m.group(1) to output [VTEC]
However both output [Honda]. How can I access the second match?
You only have one group in your expression, so you can only ever get that one group. Group 1 is the capturing group, group 0 is the whole matched text; in your expression they are one and the same. Had you omitted the (...) parentheses, you'd only have a group 0.
If you wanted to get all matches, use re.findall(). This returns a list of matching groups (or group 0, if there are no capturing groups in your expression):
>>> import re
>>> re.findall('\[[^\[\]]*\]', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
['[Honda]', '[VTEC]']
You can use re.findall to get all the matches, though you'll get them in a list, and you don't need capture groups:
m = re.findall('\[[^\[\]]*\]', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
Gives ['[Honda]', '[VTEC]'] so you can get each with:
print(m[0])
# => [Honda]
print(m[1])
# => [VTEC]
If you are considering other than re:
s="[Honda] Japanese manufacturer [VTEC] Name of electronic lift control"
result = []
tempStr = ""
flag = False
for i in s:
if i == '[':
flag = True
elif i == ']':
flag = False
elif flag:
tempStr = tempStr + i
elif tempStr != "":
result.append(tempStr)
tempStr = ""
print result
Output:
['Honda', 'VTEC']
I need to implement a Python regular expression to search for a all occurrences A1a or A_1_a or A-1-a or _A_1_a_ or _A1a, where:
A can be A to Z.
1 can be 1 to 9.
a can be a to z.
Where there are only three characters letter number letter, separated by Underscores, Dashes or nothing. The case in the search string needs to be matched exactly.
The main problem I am having is that sometimes these three letter combinations are connected to other text by dashes and underscores. Also creating the same regular expression to search for A1a, A-1-a and A_1_a.
Also I forgot to mention this is an XML file.
Thanks this found every occurrence of what I was looking for with a slight modification [-]?[A][-]?[1][-]?[a][-]?, but I need to have these be variables something like
[-]?[var_A][-]?[var_3][-]?[Var_a][-]?
would that be done like this
regex = r"[-]?[%s][-]?[%s][-]?[%s][-]?"
print re.findall(regex,var_A,var_Num,Var_a)
Or more like:
regex = ''.join(['r','\"','[-]?[',Var_X,'][-]?[',Var_Num,'][-]?[',Var_x,'][-]?','\"'])
print regex
for sstr in searchstrs:
matches = re.findall(regex, sstr, re.I)
But this isn't working
Sample Lines of the File:
Before Running Script
<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="A_3_a Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="A3a1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="A_3_a**_2 Energized from Norm" t:S="0" t:SC="5">
After Running Script
What I am getting: (It's deleting the entire line and leaving only what is below)
B_1_c
B1c1
B_1_c_2
What I Want to get:
<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="B_1_c Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="B1c1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="B_1_c_2 Energized from Norm" t:S="0" t:SC="5">
import re
import os
search_file_name = 'Alarms Test.fwn'
pattern = 'A3a'
fileName, fileExtension = os.path.splitext(search_file_name)
newfilename = fileName + '_' + pattern + fileExtension
outfile = open(newfilename, 'wb')
def find_ext(text):
matches = re.findall(r'([_-]?[A{1}][_-]?[3{1}][_-]?[a{1}][_-]?)', text)
records = [m.replace('3', '1').replace('A', 'B').replace('a', 'c') for m in matches]
if matches:
outfile.writelines(records)
return 1
else:
outfile.writelines(text)
return 0
def main():
success = 0
count = 0
with open(search_file_name, 'rb') as searchfile:
try:
searchstrs = searchfile.readlines()
for s in searchstrs:
success = find_ext(s)
count = count + success
finally:
searchfile.close()
print count
if __name__ == "__main__":
main()
You want to use the following to find your matches.
matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', s, re.I)
See regex101 demo
If your are looking to find the matches then strip all of the -, _ characters, you could do..
import re
s = '''
A1a _A_1 A_ A_1_a A-1-a _A_1_a_ _A1a _A-1-A_ a1_a A-_-5-a
_A-_-5-A a1_-1 XMDC_A1a or XMDC-A1a or XMDC_A1-a XMDC_A_1_a_ _A-1-A_
'''
def find_this(text):
matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', text, re.I)
records = [m.replace('-', '').replace('_', '') for m in matches]
print records
find_this(s)
Output
['A1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A', 'a1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A']
See working demo
To quickly get the A1as out without the punctuation, and not having to reconstruct the string from captured parts...
t = '''A1a _B_2_z_
A_1_a
A-1-a
_A_1_a_
_C1c '''
re.findall("[A-Z][0-9][a-z]",t.replace("-","").replace("_",""))
Output:
['A1a', 'B2z', 'A1a', 'A1a', 'A1a', 'C1c']
(But if you don't want to capture from FILE.TXT-2b, then you would have to be careful about most of these solutions...)
If the string can be separated by multiple underscores or dashes (e.g. A__1a):
[_-]*[A-Z][_-]*[1-9][_-]*[a-z]
If there can only be one or zero underscores or dashes:
[_-]?[A-Z][_-]?[1-9][_-]?[a-z]
regex = r"[A-Z][-_]?[1-9][-_]?[a-z]"
print re.findall(regex,some_string_variable)
should work
to just capture the parts your interested in wrap them in parens
regex = r"([A-Z])[-_]?([1-9])[-_]?([a-z])"
print re.findall(regex,some_string_variable)
if the underscores or dashes or lack thereof must match or it will return bad results you would need a statemachine whereas regex is stateless