Capture the text inside square brackets using a regex - python

I saw question here:
Regex to capture {}
which is similar to what I want, but I cannot get it to work.
My data is:
[Honda] Japanese manufacturer [VTEC] Name of electronic lift control
And I want the output to be
[Honda], [VTEC]
My expression is:
m = re.match('(\[[^\[\]]*\])', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
I would expect:
m.group(0) to output [Honda]
m.group(1) to output [VTEC]
However both output [Honda]. How can I access the second match?

You only have one group in your expression, so you can only ever get that one group. Group 1 is the capturing group, group 0 is the whole matched text; in your expression they are one and the same. Had you omitted the (...) parentheses, you'd only have a group 0.
If you wanted to get all matches, use re.findall(). This returns a list of matching groups (or group 0, if there are no capturing groups in your expression):
>>> import re
>>> re.findall('\[[^\[\]]*\]', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
['[Honda]', '[VTEC]']

You can use re.findall to get all the matches, though you'll get them in a list, and you don't need capture groups:
m = re.findall('\[[^\[\]]*\]', '[Honda] Japanese manufacturer [VTEC] Name of electronic lift control')
Gives ['[Honda]', '[VTEC]'] so you can get each with:
print(m[0])
# => [Honda]
print(m[1])
# => [VTEC]

If you are considering other than re:
s="[Honda] Japanese manufacturer [VTEC] Name of electronic lift control"
result = []
tempStr = ""
flag = False
for i in s:
if i == '[':
flag = True
elif i == ']':
flag = False
elif flag:
tempStr = tempStr + i
elif tempStr != "":
result.append(tempStr)
tempStr = ""
print result
Output:
['Honda', 'VTEC']

Related

Given string, return a dictionary of all the phone numbers in that text

I just started learning dictionaries and regex and I'm having trouble creating a dictionary. In my task, area code is a combination of plus sign and three numbers. The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily.
def find_phone_numbers(text: str) -> dict:
pattern = r'\+\w{3} \w{8}|\+\w{11}|\+\w{3} \w{7}|\+\w{10}|\w{8}|\w{7}'
match = re.findall(pattern, text)
str1 = " "
phone_str = str1.join(match)
phone_dict = {}
phones = phone_str.split(" ")
for phone in phones:
if phone[0] == "+":
phone0 = phone
if phone_str[0:4] not in phone_dict.keys():
phone_dict[phone_str[0:4]] = [phone_str[5:]]
return phone_dict
The result should be:
print(find_phone_numbers("+372 56887364 +37256887364 +33359835647 56887364 +11 1234567 +327 1 11111111")) ->
{'+372': ['56887364', '56887364'], '+333': ['59835647'], '': ['56887364', '1234567', '11111111']}
The main problem is that phone numbers with the same area code can be written together or separately. I had an idea to use a for loop to get rid of the "tail" in the form of a phone number and only the area code will remain, but I don't understand how to get rid of the tail here +33359835647. How can this be done and is there a more efficient way?
Try (the regex pattern explained here - Regex101):
import re
s = "+372 56887364 +37256887364 +33359835647 56887364 +11 1234567 +327 1 11111111"
pat = re.compile(r"(\+\d{3})?\s*(\d{7,8})")
out = {}
for pref, number in pat.findall(s):
out.setdefault(pref, []).append(number)
print(out)
Prints:
{
"+372": ["56887364", "56887364"],
"+333": ["59835647"],
"": ["56887364", "1234567", "11111111"],
}

Split string and replace Discord emoji to [name]

I have incoming messages such as <a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes: and I want this output to be [GG] [1Copy][14][eyes]Hello friend![eyes]
The below code is what I currently have and it works kind of. The above incoming example outputs [GG] [1Copy] [14] [eyes]
def shorten_emojis(content):
seperators = ("<a:", "<:")
output = []
for chunk in content.split():
if any(match in chunk for match in seperators):
parsed_chunk = []
new_chunk = chunk.replace("<", ";<").replace(">", ">;")
for emo in new_chunk.split(";"):
if emo.startswith(seperators):
emo = f"<{splits[1]}>" if len(splits := emo.split(":")) == 3 else emo
parsed_chunk.append(emo)
chunk = "".join(parsed_chunk)
output.append(chunk)
output = " ".join(output)
for e in re.findall(":.+?:", content):
output = output.replace(e, f"<{e.replace(':', '')}>")
return output
Test #1
Input: <a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:
Output: [GG] [1Copy] [14] :eyes:Hello friend!:eyes:
Desired [GG] [1Copy][14][eyes]Hello friend![eyes]
Test #2
Input: <a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:
Output: [cryLaptop] [1Copy] [14] [thoonk] [coolbutdepressed] [KL1Heart] Nice [dogwonder] OK:eyes:
Desired [cryLaptop] [GG] [1Copy] [14] [thoonk] [coolbutdepressed] [KL1Heart] Nice [dogwonder] OK[eyes]
Edit
I have edited my code block, it now works as desired.
You might use a single pattern with an alternation | to match both variations. Then in the callback of sub, you can check for the existence of group 1.
<a?:([^:<>]+)[^<>]*>|:([^:]+):
The pattern matches
<a?: Match <, optional a and :
([^:<>]+) Capture in group 1 any char except : < and >
[^<>]*> Optionally match any char except < and >, then match >
| Or
:([^:]+): Capture in group 2 all between :
See a regex demo and a Python demo.
For example
import re
pattern = r"<a?:([^:<>]+)[^<>]*>|:([^:]+):"
def shorten_emojis(content):
return re.sub(
pattern, lambda x: f"[{x.group(1)}]" if x.group(1) else f"[{x.group(2)}]"
,content
)
print(shorten_emojis("<a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:"))
print(shorten_emojis("<a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:"))
Output
[GG] [1Copy][14][eyes]Hello friend![eyes]
[cryLaptop][1Copy][14] [thoonk][coolbutdepressed][KL1Heart]Nice[dogwonder] OK[eyes]
You can do this with regular expressions. It is a library that already includes Python itself.
I have modified the code a bit to make it more compact but I think it is understood the same.
The most important thing is to detect the three groups of words. With (<. *?>) We select the <words>, with (:. *? :) the : word: and with (. *?) The rest of the text.
Then we must format it with the expected values and display them.
import re
def shorten_emojis(content):
tags = re.findall('((<.*?>)|(:.*?:)||(.*?))', content)
output=""
for tag in tags:
if re.findall("<.*?>", tag[0]):
valor=re.search(':.*?:', tag[0])
output+=f"[{valor.group()[1:-1]}]"
elif re.match(":.*?:", tag[0]):
output+=f"[{tag[0][1:-1]}]"
else:
output+=f"{tag[0]}"
return output
print(shorten_emojis("<a:GG:123456789> <:1Copy:12345678><:14:1256678>:eyes:Hello friend!:eyes:"))
print(shorten_emojis("<a:cryLaptop:738450655395446814><:1Copy:817543814481707030><:14:817543815401439232> <:thoonk:621279654711656448><:coolbutdepressed:621279653675532290><:KL1Heart:585547199480332318>Nice<:dogwonder:621251869058269185> OK:eyes:"))
RESULT:
[GG] [1Copy][14][eyes]Hello friend![eyes]
[cryLaptop][1Copy][14] [thoonk][coolbutdepressed][KL1Heart]Nice[dogwonder] OK[eyes]

Function to extract company register number from text string using Regex

I have a function which extracts the company register number (German: handelsregisternummer) from a given text. Although my regex for this particular problem matches the correct format (please see demo), I can not extract the correct company register number.
I want to extract HRB 142663 B but I get HRB 142663.
Most numbers are in the format HRB 123456 but sometimes there is the letter B attached to the end.
import re
def get_handelsregisternummer(string, keyword):
# https://regex101.com/r/k6AGmq/10
reg_1 = fr'\b{keyword}[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*)(?: B)?'
match = re.compile(reg_1)
handelsregisternummer = match.findall(string) # list of matched words
if handelsregisternummer: # not empty
return handelsregisternummer[0]
else: # no match found
handelsregisternummer = ""
return handelsregisternummer
Example text scraped from website. Linebreaks make words attached to each other:
text_impressum = """"Berlin, HRB 142663 BVAT-ID.: DE283580648Tax Reference Number:"""
Apply function:
for keyword in ['HRB', 'HRA', 'HR B', 'HR A']:
handelsregisternummer = get_handelsregisternummer(text_impressum, keyword=keyword)
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = keyword + " " + handelsregisternummer
break
if not handelsregisternummer: # if list is empty
handelsregisternummer = 'not specified'
handelsregisternummer_dict = {'handelsregisternummer':handelsregisternummer}
Afterwards I get:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663'}
But I want this:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663 B'}
You need to use two capturing groups in the regex to capture the keyword and the number, and just match the rest:
reg_1 = fr'\b({keyword})[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*(?: B)?)'
# |_________| |___________________|
Then, you need to concatenate, join all the capturing groups matched and returned with findall:
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = " ".join(handelsregisternummer)
break
See the Python demo.

How to get a value for a key in a string, when followed by another specific key=value set

my code is like:
string = "title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red title=xxxy red=anything title=xxxyyy color=red"
pattern = r'title=(.*?) color=red'
print re.compile(pattern).search(string).group(0)
and I got
"title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red title=xxxy red=anything title=xxxyyy color=red"
But I want to find all the contents of "title"s immediately followed by "color=red"
You want what immediately precedes color=red? Then use
.*title=(.*?) color=red
Demo: https://regex101.com/r/sR4kN2/1
This greedily matches everything that comes before color=red, so that only the desired title appears.
Alternatively, if you know there is a character that doesn't appear in the title, you can simplify by just using a character class exclusion. For example, if you know = won't appear:
title=([^=]*?) color=red
Or, if you know whitespace won't appear:
title=([^\s]*?) color=red
A third option, using a bit of code to find all red titles (assuming that the input always alternates title, color):
for title, color in re.findall(r'title=(.*?) color=(.*?)\( |$\)'):
if color == 'red':
print title
If you want to get the last match of a sub-regexp before a certain regexp the solution is to use a greedy skipper. For example:
>>> pattern = '.*title="([^"]*)".*color="#123"'
>>> text = 'title="123" color="#456" title="789" color="#123"'
>>> print(re.match(pattern, s).groups(1))
the first .* is greedy and it will skip as much as possible (thus skipping first title) backing up to the one that allows matching the desired color.
As a simpler example consider that
a(.*)b(.*)c
processed on
a1111b2222b3333c
will match 1111b2222 in the first group and 3333 in the second.
Why don't you skip the regexes, and use some split functionality instead:
search_title = False
found = None
string = "title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht colo\
r=red title=xxxy red=anything title=xxxyyy color=red"
parts = string.split()
for part in parts:
key, value = part.split('=', 1)
if search_title:
if key == 'title':
found = value
search_title = False
if key == 'color' and value == 'red':
search_title = True
print(found)
results in
xxxy
Regexes are nice, but can cause headaches at times.
Try this using re module
>>>string = 'title=abcd color=green title=efgh color=blue title=xyxyx color=yellow title=whatIwaht color=red'
>>>import re
>>>re.search('(.*title=?)(.*) color=red', string).group(2)
'whatIwaht'
>>>re.search('(.*title=?)(.*) color=red', string).group(2)
'xyxyx'

Python Regular Expression to find all combinations of a Letter Number Letter Designation

I need to implement a Python regular expression to search for a all occurrences A1a or A_1_a or A-1-a or _A_1_a_ or _A1a, where:
A can be A to Z.
1 can be 1 to 9.
a can be a to z.
Where there are only three characters letter number letter, separated by Underscores, Dashes or nothing. The case in the search string needs to be matched exactly.
The main problem I am having is that sometimes these three letter combinations are connected to other text by dashes and underscores. Also creating the same regular expression to search for A1a, A-1-a and A_1_a.
Also I forgot to mention this is an XML file.
Thanks this found every occurrence of what I was looking for with a slight modification [-]?[A][-]?[1][-]?[a][-]?, but I need to have these be variables something like
[-]?[var_A][-]?[var_3][-]?[Var_a][-]?
would that be done like this
regex = r"[-]?[%s][-]?[%s][-]?[%s][-]?"
print re.findall(regex,var_A,var_Num,Var_a)
Or more like:
regex = ''.join(['r','\"','[-]?[',Var_X,'][-]?[',Var_Num,'][-]?[',Var_x,'][-]?','\"'‌​])
print regex
for sstr in searchstrs:
matches = re.findall(regex, sstr, re.I)
But this isn't working
Sample Lines of the File:
Before Running Script
<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="A_3_a Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="A3a1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="A_3_a**_2 Energized from Norm" t:S="0" t:SC="5">
After Running Script
What I am getting: (It's deleting the entire line and leaving only what is below)
B_1_c
B1c1
B_1_c_2
What I Want to get:
<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="B_1_c Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="B1c1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="B_1_c_2 Energized from Norm" t:S="0" t:SC="5">
import re
import os
search_file_name = 'Alarms Test.fwn'
pattern = 'A3a'
fileName, fileExtension = os.path.splitext(search_file_name)
newfilename = fileName + '_' + pattern + fileExtension
outfile = open(newfilename, 'wb')
def find_ext(text):
matches = re.findall(r'([_-]?[A{1}][_-]?[3{1}][_-]?[a{1}][_-]?)', text)
records = [m.replace('3', '1').replace('A', 'B').replace('a', 'c') for m in matches]
if matches:
outfile.writelines(records)
return 1
else:
outfile.writelines(text)
return 0
def main():
success = 0
count = 0
with open(search_file_name, 'rb') as searchfile:
try:
searchstrs = searchfile.readlines()
for s in searchstrs:
success = find_ext(s)
count = count + success
finally:
searchfile.close()
print count
if __name__ == "__main__":
main()
You want to use the following to find your matches.
matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', s, re.I)
See regex101 demo
If your are looking to find the matches then strip all of the -, _ characters, you could do..
import re
s = '''
A1a _A_1 A_ A_1_a A-1-a _A_1_a_ _A1a _A-1-A_ a1_a A-_-5-a
_A-_-5-A a1_-1 XMDC_A1a or XMDC-A1a or XMDC_A1-a XMDC_A_1_a_ _A-1-A_
'''
def find_this(text):
matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', text, re.I)
records = [m.replace('-', '').replace('_', '') for m in matches]
print records
find_this(s)
Output
['A1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A', 'a1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A']
See working demo
To quickly get the A1as out without the punctuation, and not having to reconstruct the string from captured parts...
t = '''A1a _B_2_z_
A_1_a
A-1-a
_A_1_a_
_C1c '''
re.findall("[A-Z][0-9][a-z]",t.replace("-","").replace("_",""))
Output:
['A1a', 'B2z', 'A1a', 'A1a', 'A1a', 'C1c']
(But if you don't want to capture from FILE.TXT-2b, then you would have to be careful about most of these solutions...)
If the string can be separated by multiple underscores or dashes (e.g. A__1a):
[_-]*[A-Z][_-]*[1-9][_-]*[a-z]
If there can only be one or zero underscores or dashes:
[_-]?[A-Z][_-]?[1-9][_-]?[a-z]
regex = r"[A-Z][-_]?[1-9][-_]?[a-z]"
print re.findall(regex,some_string_variable)
should work
to just capture the parts your interested in wrap them in parens
regex = r"([A-Z])[-_]?([1-9])[-_]?([a-z])"
print re.findall(regex,some_string_variable)
if the underscores or dashes or lack thereof must match or it will return bad results you would need a statemachine whereas regex is stateless

Categories

Resources