regular expression for the extracting multiple patterns - python

I have string like this
string="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
tokens=re.findall('(.*)\r\n(.*?:)(.*?])',string)
Output
('Claim Status', '[Primary Status:', ' Paidup to Rebilled]')
('General Info.', '[PA Number:', ' R180126187]')
('Claim Insurance: Modified', '[Ins. Mode:', ' Primary]')
Wanted output:
('Claim Status', 'Primary Status:Paidup to Rebilled')
('General Info.', 'PA Number:R180126187')
('Claim Insurance: Modified', 'Ins. Mode:Primary','ICN: ########', 'Id: #########')

You may achieve what you need with a solution like this:
import re
s="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
res = []
for m in re.finditer(r'^(.+)(?:\r?\n\s*\[(.+)])?\r?$', s, re.M):
t = []
t.append(m.group(1).strip())
if m.group(2):
t.extend([x.strip() for x in m.group(2).strip().split('], [') if ':' in x])
res.append(tuple(t))
print(res)
See the Python online demo. Output:
[('Claim Status', 'Primary Status: Paidup to Rebilled'), ('General Info.', 'PA Number: #######'), ('Claim Insurance: Modified', 'Ins. Mode: Primary', 'ICN: #######', 'Id: ########')]
With the ^(.+)(?:\r?\n\s*\[(.+)])?\r?$ regex, you match two consecutive lines with the second being optional (due to the (?:...)? optional non-capturing group), the first is captured into Group 1 and the subsequent one (that starts with [ and ends with ]) is captured into Group 2. (Note that \r?$ is necessary since in the multiline mode $ only matches before a newline and not a carriage return.) Group 1 value is added to a temporary list, then the contents of the second group is split with ], [ (if you are not sure about the amount of whitespace, you may use re.split(r']\s*,\s*\[', m.group(2))) and then only add those items that contain a : in them to the temporary list.

You are getting three elements per result because you are using "capturing" regular expressions. Rewrite your regexp like this to combine the second and third match:
re.findall('(.*)\r\n((?:.*?:)(?:.*?]))',string)
A group delimited by (?:...) (instead of (...)) is "non-capturing", i.e. it doesn't count as a match target for \1 etc., and it does not get "seen" by re.findall. I have made both your groups non-capturing, and added a single capturing (regular) group around them.

Related

Using regex to parse string patterns from a single string

I have read in product pricing for some products. As you will see below, not every product pricing string is set up the same. What I am trying to do is to parse out the sub-strings I do not want.
Below is the code I have which works, but there has to be a more efficient way to do this.
tmp1 = p_pricing.replace("from ", "")
tmp1 = tmp1.replace("Options Available on Open Box", "")
tmp1 = tmp1.replace("Open Box Price: From ", "")
tmp1 = re.sub(r'\([^)]*\)', '', tmp1)
tmp1 = re.split("[$]", tmp1)
Below is a small sample of my pricing string:
$11.99($6.00 per item)$14.99
from $13.99$18.25
$9.89($4.94 per item)$14.99
from $9.83($3.28 per item)
from $15.99$29.99
from $84.99$104.95
from $9.83($3.28 per item)
$3.47
$94.99$129.99
from $14.34$19.90
from $25.01$65.00Options Available on Open Box
It seems you just want to get the numeric values of all prices in each string.
You can use
re.findall(r'\$(\d+(?:\.\d+)?)', text)
See the regex demo.
Details
\$ - a $ char
(\d+(?:\.\d+)?) - Capturing group 1: one or more digits, and then an optional occurrence of a . and one or more digits.
See the Python demo:
import re
pattern = r"\$(\d+(?:\.\d+)?)"
text = "$11.99($6.00 per item)$14.99\nfrom $13.99$18.25\n$9.89($4.94 per item)$14.99\nfrom $9.83($3.28 per item) \nfrom $15.99$29.99\nfrom $84.99$104.95\nfrom $9.83($3.28 per item) \n$3.47\n$94.99$129.99\nfrom $14.34$19.90\nfrom $25.01$65.00Options Available on Open Box"
print( re.findall(pattern, text) )
Output:
['11.99', '6.00', '14.99', '13.99', '18.25', '9.89', '4.94', '14.99', '9.83', '3.28', '15.99', '29.99', '84.99', '104.95', '9.83', '3.28', '3.47', '94.99', '129.99', '14.34', '19.90', '25.01', '65.00']
As you are replacing from and opening till closing parenthesis in your code using \([^)]*\) with an empty string, you can get all the prices outside of the parenthesis by matching from an opening parenthesis till a closing parenthesis.
Then use an alternation | and capture what you want to keep.
The digits are in capture group 1.
\([^()]*\)|\$(\d+(?:\.\d+))
\([^()]*\) Match from an opening till closing parenthesis
| Or
\$ Match a dollar sign
(\d+(?:\.\d+)) Capture group 1 Match 1+ digits and an optional decimal part
See a regex demo or a Python demo
Example code
import re
pattern = r"\([^()]*\)|\$(\d+(?:\.\d+))"
s = "$11.99($6.00 per item)$14.99 from $13.99$18.25 $9.89($4.94 per item)$14.99 from $9.83($3.28 per item) from $15.99$29.99 from $84.99$104.95 from $9.83($3.28 per item) $3.47 $94.99$129.99 from $14.34$19.90 from $25.01$65.00Options Available on Open Box"
print([s for s in re.findall(pattern, s) if s])
Output
['11.99', '14.99', '13.99', '18.25', '9.89', '14.99', '9.83', '15.99', '29.99', '84.99', '104.95', '9.83', '3.47', '94.99', '129.99', '14.34', '19.90', '25.01', '65.00']

Regex match match "words" that contain two continuous streaks of digits and letters or vice-versa and split them

I am having following line of text as given below:
text= 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
I am trying to split numbers followed by characters or characters followed by numbers only to get the output as:
output_text = 'Cms 12345678 Gleandaleacademy Fee Collection 00001234 Abcd Renewal 123Acgf456789
I have tried the following approcah:
import re
text = 'Cms12345678 Gleandaleacademy Fee Collection 00001234Abcd Renewal 123Acgf456789'
text = text.lower().strip()
text = text.split(' ')
output_text =[]
for i in text:
if bool(re.match(r'[a-z]+\d+|\d+\w+',i, re.IGNORECASE))==True:
out_split = re.split('(\d+)',i)
for j in out_split:
output_text.append(j)
else:
output_text.append(i)
output_text = ' '.join(output_text)
Which is giving output as:
output_text = 'cms 12345678 gleandaleacademy fee collection 00001234 abcd renewal 123 acgf 456789 '
This code is also splliting the last element of text 123acgf456789 due to incorrect regex in re.match.
Please help me out to get correct output.
You can use
re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text)
See the regex demo
Details
\b - word boundary
(?: - start of a non-capturing group (necessary for the word boundaries to be applied to all the alternatives):
([a-zA-Z]+)(\d+) - Group 1: one or more letters and Group 2: one or more digits
| - or
(\d+)([a-zA-Z]+) - Group 3: one or more digits and Group 4: one or more letters
) - end of the group
\b - word boundary
During the replacement, either \1 and \2 or \3 and \4 replacement backreferences are initialized, so concatenating them as \1\3 and \2\4 yields the right results.
See a Python demo:
import re
text = "Cms1291682971 Gleandaleacademy Fee Collecti 0000548Andb Renewal 402Ecfev845410001"
print( re.sub(r'\b(?:([a-zA-Z]+)(\d+)|(\d+)([a-zA-Z]+))\b', r'\1\3 \2\4', text) )
# => Cms 1291682971 Gleandaleacademy Fee Collecti 0000548 Andb Renewal 402Ecfev845410001

Match names, dialogues, and actions from transcript using regex

Given a string dialogue such as below, I need to find the sentence that corresponds to each user.
text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'
For the above dialogue, I would like to return tuples with three elements with:
The name of the person
The sentence in lower case and
The sentences within Brackets
Something like this:
('CHRIS', 'Hello, how are you...', None)
('PETER', 'Great, you?', None)
('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')
('CHRIS', 'Are you ok?', None)
etc...
I am trying to use regex to achieve the above. So far I was able to get the names of the users with the below code. I am struggling to identify the sentence between two users.
actors = re.findall(r'\w+(?=\s*:[^/])',text)
You can do this with re.findall:
>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
('PETER', ' Great, you? ', ''),
('PAM',
' He is resting.',
'[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
('CHRIS', ' Are you ok?', '')]
You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.
Regex Breakdown
\b # Word boundary
(\S+) # First capture group, string of characters not having a space
: # Colon
( # Second capture group
[^ # Match anything that is not...
: # a colon
\[\] # or square braces
]+? # Non-greedy match
)
\n? # Optional newline
( # Third capture group
\[ # Literal opening brace
[^:]+? # Similar to above - exclude colon from match
\]
\n? # Optional newlines
)? # Third capture group is optional
(?= # Lookahead for...
\b # a word boundary, followed by
\S+ # one or more non-space chars, and
: # a colon
| # Or,
$ # EOL
)
Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.
For example, we could first find groups of names and text:
from itertools import groupby
def isName(word):
# Names end with ':'
return word.endswith(":")
text_split = [
" ".join(list(g)).rstrip(":")
for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']
Next you can collect pairs of consecutive elements in text_split into tuples:
print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]
We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)
Here's something quick that I came up with:
def isClosingBracket(word):
return word.endswith("]")
def processWords(words):
if "[" not in words:
return [words, None]
else:
return [
" ".join(g).replace("]", ".")
for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
]
print(
[(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]
Note that using the * to unpack the result of processWords into the tuple is strictly a python 3 feature.

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

regex to parse out certain value that i want

Using https://regex101.com/
MY current regex Expression: ^.*'(\d\s*.*)'*$
which doesnt seem to be working. What is the right combination formula that i should use?
I want to able to parse out 4 variable namely
items, quantity, cost and Total
MY CODE:
import re
str = "xxxxxxxxxxxxxxxxxx"
match = re.match(r"^.*'(\d\s*.*)'*$",str)
print match.group(1)
The following regex matches each ingredient string and stores wanted informations into groups: r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$'
It defines 3 groups each separated from other by spaces:
^ marks the string start
(\d+) is the first group and looks for at least one digit
\s+ is the first separation between groups and looks for at least one white character
([A-Za-z ]+) is the second group and looks for a least one alphabetical character or space
\s+ is the second separation beween groups and looks for at least one white character
(\d+(?:\.\d*) is the third group and looks for at least one digit with eventually a decimal point and some other digits
$ marks the string end
A regex to obtain the total does not need to be explained I think.
Here is a test code using your test data. Is should be a good starting point:
import re
TEST_DATA = ['Table: Waiter: kenny',
'======================================',
'1 SAUSAGE WRAPPED WITH B 10.00',
'1 ESCARGOT WITH GARLIC H 12.00',
'1 PAN SEARED FOIE GRAS 15.00',
'1 SAUTE FIELD MUSHROOM W 9.00',
'1 CRISPY CHICKEN WINGS 7.00',
'1 ONION RINGS 6.00',
'----------------------------------',
'TOTAL 59.00',
'CASH 59.00',
'CHANGE 0.00',
'Signature:__________________________',
'Thank you & see you again soon!']
INGREDIENT_RE = re.compile(r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$')
TOTAL_RE = re.compile(r'^TOTAL (.+)$')
ingredients = []
total = None
for string in TEST_DATA:
match = INGREDIENT_RE.match(string)
if match:
ingredients.append(match.groups())
continue
match = TOTAL_RE.match(string)
if match:
total = match.groups()[0]
break
print(ingredients)
print(total)
this prints:
[('1', 'SAUSAGE WRAPPED WITH B', '10.00'), ('1', 'ESCARGOT WITH GARLIC H', '12.00'), ('1', 'PAN SEARED FOIE GRAS', '15.00'), ('1', 'SAUTE FIELD MUSHROOM W', '9.00'), ('1', 'CRISPY CHICKEN WINGS', '7.00'), ('1', 'ONION RINGS', '6.00')]
59.00
Edit on Python raw strings:
The r character before a Python string indicates that it is a raw string, which means that spécial characters (like \t, \n, etc...) are not interpreted.
To be clear, and for example, in a standard string \t is one tabulation character. It a raw string it is two characters: \ and t.
r'\t' is equivalent to '\\t'.
more details in the doc

Categories

Resources