Python - Regex did not match string - python

I am trying to get last regex match on a message broadcast by socket, but it returns blank.
>>> msg = ':morgan.freenode.net 353 MechaBot = #xshellz :MechaBot ITechGeek zubuntu whitesn JarodRo SpeedFuse st3v0 anyx danielhyuuga1 AussieKid92 JeDa Failed Guest83885 RiXtEr xryz D-Boy warsoul buggiz rawwBNC MagixZ fedai Sunborn oatgarum dune SamUt Pythonista_ +xinfo madmattco BuGy azuan DarianC stupidpioneers AnTi_MTtr JeDaYoshi|Away PaoLo- StephenS chriscollins Rashk0 morbid1 Lord255 victorix [DS]Matej EvilSoul `|` united Scrawn avira ssnova munsterman Logxen niko gorut Jactive|OFF grauwulf b0lt saapete'
>>> r = re.compile(r"(?P<host>.*?) (?P<code>.*?) (?P<name>.*?) = (?P<msg>.*?)", re.IGNORECASE)
>>> r.search(msg).groups()
(':morgan.freenode.net', '353', 'MechaBot', '')

(?P<host>.*?) (?P<code>.*?) (?P<name>.*?) = (?P<msg>.*)
Try this.This works.See demo.Your code use .*? whch says match as few characters as you can.So while it your previous you have used .*? <space> it matches upto first space it encounters,in the last you have no specified anythng.So it did not match anythin as it s in lazy mode.
https://regex101.com/r/aQ3zJ3/1
You can also use
(?P<host>.*?) (?P<code>.*?) (?P<name>.*?) = (?P<msg>.*?)$
which says match lazily upto end.

Related

REGEX_String between strings in a list

From this list:
['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
I would like to reduce it to this list:
['BELMONT PARK', 'EAGLE FARM']
You can see from the first list that the desired words are between '\n' and '('.
My attempted solution is:
for i in x:
result = re.search('\n(.*)(', i)
print(result.group(1))
This returns the error 'unterminated subpattern'.
Thankyou
You’re getting an error because the ( is unescaped. Regardless, it will not work, as you’ll get the following matches:
\nBELMONT PARK (
\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (
You can try the following:
(?<=\\n)(?!.*\\n)(.*)(?= \()
(?<=\\n): Positive lookbehind to ensure \n is before match
(?!.*\\n): Negative lookahead to ensure no further \n is included
(.*): Your match
(?= \(): Positive lookahead to ensure ( is after match
You can get the matches without using any lookarounds, as you are already using a capture group.
\n(.*) \(
Explanation
\n Match a newline
(.*) Capture group 1, match any character except a newline, as much as possible
\( Match a space and (
See a regex101 demo and a Python demo.
Example
import re
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
for i in x:
m = re.search(pattern, i)
if m:
print(m.group(1))
Output
BELMONT PARK
EAGLE FARM
If you want to return a list:
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
res = [m.group(1) for i in x for m in [re.search(pattern, i)] if m]
print(res)
Output
['BELMONT PARK', 'EAGLE FARM']

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

Regex to catch only the certain part of the string

Is there universal regex to catch only the names of companies?
Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc
Q4_2018_Control4_Corp
The output should be:
American_Airlines_Group_Inc
Apple_Inc
Alcoa_Inc
Arconic_Inc
Orkla_ASA
AGCO_Corp
Autodesk_Inc
Note:
The name of the company may contain symbols or numbers
You can use this regex,
[a-zA-Z]+(?:_[a-zA-Z]+)*$
Your company names all start with alphabetical words and hyphen separated till end of string, for which above regex will work fine.
Here, [a-zA-Z]+ starts matching alphabetical company names, and (?:_[a-zA-Z]+)* further matches any alphabetical words having hyphen before them and $ ensures the matched string ends with the string.
Regex Demo
Python code,
import re
arr = ['Q4_2017_American_Airlines_Group_Inc','Q1_2016_Apple_Inc','Q4_2014_Alcoa_Inc','Q3_2015_Arconic_Inc','Q3_2017_Orkla_ASA','Q2_2018_AGCO_Corp','Quarter_3_2018_Autodesk_Inc']
for s in arr:
m = re.search(r'[a-zA-Z]+(?:_[a-zA-Z]+)*$', s)
print(s, '-->', m.group())
Prints,
Q4_2017_American_Airlines_Group_Inc --> American_Airlines_Group_Inc
Q1_2016_Apple_Inc --> Apple_Inc
Q4_2014_Alcoa_Inc --> Alcoa_Inc
Q3_2015_Arconic_Inc --> Arconic_Inc
Q3_2017_Orkla_ASA --> Orkla_ASA
Q2_2018_AGCO_Corp --> AGCO_Corp
Quarter_3_2018_Autodesk_Inc --> Autodesk_Inc
Also, if you have a single string of those company names, then you can use following code and use re.findall to list all company names,
import re
s = '''Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc'''
print(re.findall(r'(?m)[a-zA-Z]+(?:_[a-zA-Z]+)*$', s))
Prints,
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']
Edit:
As Chyngyz Akmatov raised, if name can contain numbers and in general any symbol, then this regex will get the name properly, which assumes company name starts after year part and underscore.
(?<=\d{4}_).*$
Demo handling any character in company name
You can use re.sub:
import re
data = [re.sub('\w+\d{4}_', '', i) for i in filter(None, content.split('\n'))]
Output:
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']
You can also use this regex:
_\d+(?:_\d+)*_(.*)
Code:
import re
lst = ['Q4_2017_American_Airlines_Group_Inc', 'Q1_2016_Apple_Inc', 'Q4_2014_Alcoa_Inc', 'Q3_2015_Arconic_Inc', 'Q3_2017_Orkla_ASA', 'Q2_2018_AGCO_Corp', 'Quarter_3_2018_Autodesk_Inc']
for x in lst:
print(re.search(r'_\d+(?:_\d+)*_(.*)', x).group(1))
# American_Airlines_Group_Inc
# Apple_Inc
# Alcoa_Inc
# Arconic_Inc
# Orkla_ASA
# AGCO_Corp
# Autodesk_Inc
Assuming there are only normal letters and the names are the end of each line :
grep -o '[A-Za-z][A-Za-z_]*$' names

Match words only if preceded by specific pattern

I have a string from a NWS bulletin:
LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley
My aim is to extract a couple fields with regular expressions. In the first string I want "AAD" and from the second string I want "RECHNX". I have tried:
( )\w{3} #for the first string
and
\w{6} #for the 2nd string
But these find all 3 and 6 character strings leading up to the string I want.
Assuming the fields you want to extract are always in capital letters and preceded by 6 digits and a space, this regular expression would do the trick:
(?<=\d{6}\s)[A-Z]+
Demo: https://regex101.com/r/dsDHTs/1
Edit: if you want to match up to two alpha-numeric uppercase words preceded by 6 digits, you can use:
(?<=\d{6}\s)([A-Z0-9]+\b)\s(?:([A-Z0-9]+\b))*
Demo: https://regex101.com/r/dsDHTs/5
If you have a specific list of valid fields, you could also simply use:
(AAD|TMLB|RECHNX|RR4HNX)
https://regex101.com/r/dsDHTs/3
Since the substring you want to extract is a word that follows a number, separated by a space, you can use re.search with the following regex (given your input stored in s):
re.search(r'\b\d+ (\w+)', s).group(1)
To read first groups of word chars from each line, you can use a pattern like
(\w+) (\w+) (\w+) (\w+).
Then, from the first line read group No 4 and from the second line read group No 3.
Look at the following program. It prints four groups from each source line:
import re
txt = """LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley"""
n = 0
pat = re.compile(r'(\w+) (\w+) (\w+) (\w+)')
for line in txt.splitlines():
n += 1
print(f'{n:2}: {line}')
mtch = pat.search(line)
if mtch:
gr = [ mtch.group(i) for i in range(1, 5) ]
print(f' {gr}')
The result is:
1: LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
['LTUS41', 'KCAR', '141558', 'AAD']
2: KHNX 141001 RECHNX Weather Service San Joaquin Valley
['KHNX', '141001', 'RECHNX', 'Weather']

Trouble scooping out a certain portion of text out of a chunk

How can I get the address appeared before Telephone from the portion of text I've pasted below. I tried with the following but it gives me nothing.
This is the code I've tried so far with:
import re
content="""
Campbell, Bellam Associés Inc.
3003 Rue College
Sherbrooke, QC J1M 1T8
Telephone: 819-569-9255
Website: http://www.assurancescb.com
"""
pattern = re.compile(r"(.*)(?=Telephone)")
for item in pattern.finditer(content):
print(item.group())
Expected output:
Campbell, Bellam Associés Inc.
3003 Rue College
Sherbrooke, QC J1M 1T8
The block of texts are always like the pasted one and there is no flag attached to it using which I opt for positive lookbehind so I tried like above instead.
The dot does not match a line break character so you could use a modifier (?s) or use re.S or re.DOTALL
pattern = re.compile(r"(.*)(?=Telephone)", re.S)
or
pattern = re.compile(r"(?s)(.*)(?=Telephone)")
You could also get the match without using a group:
(?s).*(?=Telephone)
Change the line
pattern = re.compile(r"(.*)(?=Telephone)")
To
pattern = re.compile(r"(.*)(?=Telephone)", re.DOTALL)
So that your regex wildcard (*) would match newline characters.
:)

Categories

Resources