Grouping data with a regex in Python

Grouping data with a regex in Python - python

I have some raw data like this:
Dear John Buy 1 of Coke, cost 10 dollars
Ivan Buy 20 of Milk
Dear Tina Buy 10 of Coke, cost 100 dollars
Mary Buy 5 of Milk
The rule of the data is:
Not everyone will start with "Dear", while if there is any, it must end with costs
The item may not always normal words, it could be written without limits (including str, num, etc.)
I want to group the information, and I tried to use regex. That's what I tried before:
for line in file.readlines():
match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line)
if match is not None:
print(match.groups())
file.close()
Now the output looks like:
('John', '1', 'Coke', '10')
('Ivan', '20', 'Milk', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Milk', '')
Showing above is what I want. However, if the item is replaced by some strange string like A1~A10, some of outputs will get wrong info:
('Ivan', '20', 'A1', '10')
('Mary', '5', 'A1', '10')
I think the constant format in the item field is that it will always end with , (if there is any). But I just don't know how to use the advantage.
Thought it's temporarily success by using the code above, I thought the (?P<item>\w+) has to be replaced like (?P<item>.+). If I do so, it'll take wrong string in the tuple like:
('John', '1', 'Coke, cost 10 dollars', '')
How could I read the data into the format I want by using the regex in Python?

I have tried this regular expression
^(Dear)?\s*(?P<name>\w*)\D*(?P<num>\d+)\sof\s(?P<drink>\w*)(,\D*(?P<cost>\d+)\D*)?
Explanation
^(Dear)? match line starting either with Dear if exists
(?P<name>\w*) a name capture group to capture the name
\D* match any non-digit characters
(?P<num>\d+) named capture group to get the num.
\sof\s matching string of
(?P<drink>\w*) to get the drink
(,\D*(?P<cost>\d+)\D*)? this is an optional group to get the cost of the drink
with
>>> reobject = re.compile('^(Dear)?\s*(\w*)[\sa-zA-Z]*(\d+)\s*\w*\s*(\w*)(,[\sa-zA-Z]*(\d+)[\s\w]*)?')
First data snippet
>>> data1 = 'Dear John Buy 1 of Coke, cost 10 dollars'
>>> match_object = reobject.search(data1)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('John', '1', 'Coke', '10')
Second data snippet
>>> data2 = ' Ivan Buy 20 of Milk'
>>> match_object = reobject.search(data2)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('Ivan', '20', 'Milk', None)

Without regex:
with open('commandes.txt') as f:
results = []
for line in f:
parts = line.split(None, 5)
price = ''
if parts[0] == 'Dear':
tmp = parts[5].split(',', 1)
for tok in tmp[1].split():
if tok.isnumeric():
price = tok
break
results.append((parts[1], parts[3], tmp[0], price))
else:
results.append((parts[0], parts[2], parts[4].split(',')[0], price))
print(results)
It doesn't care what characters are used except spaces until the product name, that's why each line is splitted by spaces in 5 parts. When the line starts with "Dear", the last part is separated by the comma to extract the product name and the price. Note that if the price is always at the same place (ie: after "cost"), you can avoid the innermost for loop and replace it with price = tmp[1].split()[1]
Note: if you want to prevent empty lines to be processed, you can change the first for loop to:
for line in (x for x in f if x.rstrip()):

I would use this regex:
r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?'
Demo
>>> line = 'Dear Tina Buy 10 of A1~A10'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', None)
>>> line = 'Dear Tina Buy 10 of A1~A10, cost 100 dollars'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', '100')
Explanation
The first section of your regex is perfectly fine, here’s the tricky part:
(?P<item>[^,]+) As we're sure that the string will contain a comma when the cost string is present, here we say that we want anything but comma to set the item value.
(?:,\D+)?(?P<costs>\d+)? Here we're using two groups. The important thing is the ? after the parenthesis enclosing the groups:
'?' Causes the resulting RE to match 0 or 1 repetitions of the
preceding RE. ab? will match either ‘a’ or ‘ab’.
So we use ? to match both possibilities (with the cost string present or not)
(?:,\D+) is a non-capturing that will match a comma followed by anything but a digit.
(?P<costs>\d+) will capture any digit in the named group cost.

If you use .+, the subpattern will grab the whole rest of the line as . matches any character but a newline without the re.S flag.
You can replace the \w+ with a negated character class subpattern [^,]+ to match one or more characters other than a comma:
r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)\D*(?P<costs>\d*)'
^^^^^
See the IDEONE demo:
import re
file = "Dear John Buy 1 of A1~A10, cost 10 dollars\n Ivan Buy 20 of Milk\nDear Tina Buy 10 of Coke, cost 100 dollars\n Mary Buy 5 of Milk"
for line in file.split("\n"):
match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,\W]+)\D*(?P<costs>\d*)',line)
if match:
print(match.groups())
Output:
('John', '1', 'A1~A10', '10')
('Ivan', '20', 'Mil', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Mil', '')

Related

Regex pattern to match multiple characters and split

I haven't used regex much and was having issues trying to split out 3 specific pieces of info in a long list of text I need to parse.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
pattern1 = Parse the **Name Text**
pattern2 = Parse the number `#x`
pattern3 = Grab everything else until the next pattern 1
What I have doesn't seem to work well. There are empty elements? They are not grouped together? And I can't figure out how to grab the last pattern text without it affecting the first 2 patterns. I'd also like it if all 3 matches were in a tuple together rather than separated. Here's what I have so far:
all = r"\*\*(.+?)\*\*|\`#(.+?)\`:"
l = re.findall(all, note)
Output:
[('Jane Greiz', ''), ('', '1'), ('Thomas Fitzpatrick', ''), ('', '90'), ('Anthony Smith', ''), ('', '91')]

Don't use alternatives. Put the name and number patterns after each other in a single alternative, and add another group for the match up to the next **.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
all = r"\*\*(.+?)\*\*.*?\`#(.+?)\`:(.*)"
print(re.findall(all, note))
Output is:
[('Jane Greiz', '1', ' Should be open here .'), ('Thomas Fitzpatrick', '90', ' Anim: Can we start the movement.'), ('Anthony Smith', '91', ' Her left shoulder.')]

How can I extract some patterns of sub text from a gibberish looking text using regular expressions?

I have some text where the age and gender of a person is mentioned in some of the records (not all) as 28M, or 35 F, or 29 male, or 57Female, etc.
I wrote the following regular expression to check if there is any pattern that matches a number followed by a M in an input string, and if yes to print it out, but the code does not print anything:
import re
text = 'Decision: Standard\r\n\r\n 36M NS\r\nBasic - 500th MP tdb addd cib 250th\r\n\r\nDue Date: Settlement date'
test_search = re.search('[0-9]+M', text)
if test_search:
print("Age: "+test_search.group(0)+", Gender: "+test_search.group(1))
I expected it to have printed Age: 36, Gender: M. However, it does nothing - no error, no output, nothing.
I tried re.match('[0-9]+F', text), nothing happened there either.
Also, I thought I have to write as many regular expressions as there are patterns (one each for 28M, 35 F, 29Male, 57 female, etc). Is that the correct approach? Or is there a way to search/find/match all of these patterns at once?

You may use this regex to match all the cases you have mentioned in question:
results = re.findall(r'(?i)(\d+)\s*([mf]|(?:fe)?male)\b', text)
RegEx Demo
Details:
(?i): Ignore case modifier
(\d+): Match and capture 1+ digits in group #1
\s*: Match 0 or more whitespaces
([mf]|(?:fe)?male): Match or capture M or F or male or female in group #2
\b: word boundary

You can use this regex
([0-9]+)\s?([M|Male|F|Female]+) and capture the age and name in seperate capturing groups.
Python Demo
import re
test_str = r"Decision: Standard\\r\\n\\r\\n 36M NS\\r\\nBasic - 500th MP tdb addd cib 250th\\r\\n\\r\\nDue Date: Settlement date 29 male 57Female 35 F"
pattern = r"([0-9]+)\s?([M|Male|F|Female]+)"
def return_gender_dict(match_obj):
return { 'age': match_obj[0], 'gender': match_obj[1][0].upper() }
matches = re.findall(pattern, test_str, flags=re.MULTILINE | re.IGNORECASE)
result = [return_gender_dict(match) for match in matches]
print(result)
Outputting:
[{'age': '36', 'gender': 'M'}, {'age': '29', 'gender': 'M'}, {'age': '57', 'gender': 'F'}, {'age': '35', 'gender': 'F'}]

Try the following re
(\d\d)(M|F|Male|Female|\sM|\sF|\sMale|\sFemale)

regular expression for the extracting multiple patterns

I have string like this
string="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
tokens=re.findall('(.*)\r\n(.*?:)(.*?])',string)
Output
('Claim Status', '[Primary Status:', ' Paidup to Rebilled]')
('General Info.', '[PA Number:', ' R180126187]')
('Claim Insurance: Modified', '[Ins. Mode:', ' Primary]')
Wanted output:
('Claim Status', 'Primary Status:Paidup to Rebilled')
('General Info.', 'PA Number:R180126187')
('Claim Insurance: Modified', 'Ins. Mode:Primary','ICN: ########', 'Id: #########')

You may achieve what you need with a solution like this:
import re
s="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
res = []
for m in re.finditer(r'^(.+)(?:\r?\n\s*\[(.+)])?\r?$', s, re.M):
t = []
t.append(m.group(1).strip())
if m.group(2):
t.extend([x.strip() for x in m.group(2).strip().split('], [') if ':' in x])
res.append(tuple(t))
print(res)
See the Python online demo. Output:
[('Claim Status', 'Primary Status: Paidup to Rebilled'), ('General Info.', 'PA Number: #######'), ('Claim Insurance: Modified', 'Ins. Mode: Primary', 'ICN: #######', 'Id: ########')]
With the ^(.+)(?:\r?\n\s*\[(.+)])?\r?$ regex, you match two consecutive lines with the second being optional (due to the (?:...)? optional non-capturing group), the first is captured into Group 1 and the subsequent one (that starts with [ and ends with ]) is captured into Group 2. (Note that \r?$ is necessary since in the multiline mode $ only matches before a newline and not a carriage return.) Group 1 value is added to a temporary list, then the contents of the second group is split with ], [ (if you are not sure about the amount of whitespace, you may use re.split(r']\s*,\s*\[', m.group(2))) and then only add those items that contain a : in them to the temporary list.

You are getting three elements per result because you are using "capturing" regular expressions. Rewrite your regexp like this to combine the second and third match:
re.findall('(.*)\r\n((?:.*?:)(?:.*?]))',string)
A group delimited by (?:...) (instead of (...)) is "non-capturing", i.e. it doesn't count as a match target for \1 etc., and it does not get "seen" by re.findall. I have made both your groups non-capturing, and added a single capturing (regular) group around them.

regex to parse out certain value that i want

Using https://regex101.com/
MY current regex Expression: ^.*'(\d\s*.*)'*$
which doesnt seem to be working. What is the right combination formula that i should use?
I want to able to parse out 4 variable namely
items, quantity, cost and Total
MY CODE:
import re
str = "xxxxxxxxxxxxxxxxxx"
match = re.match(r"^.*'(\d\s*.*)'*$",str)
print match.group(1)

The following regex matches each ingredient string and stores wanted informations into groups: r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$'
It defines 3 groups each separated from other by spaces:
^ marks the string start
(\d+) is the first group and looks for at least one digit
\s+ is the first separation between groups and looks for at least one white character
([A-Za-z ]+) is the second group and looks for a least one alphabetical character or space
\s+ is the second separation beween groups and looks for at least one white character
(\d+(?:\.\d*) is the third group and looks for at least one digit with eventually a decimal point and some other digits
$ marks the string end
A regex to obtain the total does not need to be explained I think.
Here is a test code using your test data. Is should be a good starting point:
import re
TEST_DATA = ['Table: Waiter: kenny',
'======================================',
'1 SAUSAGE WRAPPED WITH B 10.00',
'1 ESCARGOT WITH GARLIC H 12.00',
'1 PAN SEARED FOIE GRAS 15.00',
'1 SAUTE FIELD MUSHROOM W 9.00',
'1 CRISPY CHICKEN WINGS 7.00',
'1 ONION RINGS 6.00',
'----------------------------------',
'TOTAL 59.00',
'CASH 59.00',
'CHANGE 0.00',
'Signature:__________________________',
'Thank you & see you again soon!']
INGREDIENT_RE = re.compile(r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$')
TOTAL_RE = re.compile(r'^TOTAL (.+)$')
ingredients = []
total = None
for string in TEST_DATA:
match = INGREDIENT_RE.match(string)
if match:
ingredients.append(match.groups())
continue
match = TOTAL_RE.match(string)
if match:
total = match.groups()[0]
break
print(ingredients)
print(total)
this prints:
[('1', 'SAUSAGE WRAPPED WITH B', '10.00'), ('1', 'ESCARGOT WITH GARLIC H', '12.00'), ('1', 'PAN SEARED FOIE GRAS', '15.00'), ('1', 'SAUTE FIELD MUSHROOM W', '9.00'), ('1', 'CRISPY CHICKEN WINGS', '7.00'), ('1', 'ONION RINGS', '6.00')]
59.00
Edit on Python raw strings:
The r character before a Python string indicates that it is a raw string, which means that spécial characters (like \t, \n, etc...) are not interpreted.
To be clear, and for example, in a standard string \t is one tabulation character. It a raw string it is two characters: \ and t.
r'\t' is equivalent to '\\t'.
more details in the doc

Python regular expression (regex) match comma separated number - why does this not work?

I am trying to parse transaction letters from my (German) bank.
I'd like to extract all the numbers from the following string which turns out to be harder than I thought.
Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.
My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?
Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.
# -*- coding: utf-8 -*-
import re
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)
Output is
['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']

Option 1 is the most suitable of the regex, but it is not working correctly because findall will return what is matched by the capture group (), not the complete match.
For example, the first three matches in your example will be the 18, 04 and 2013, and in each case the capture group will be unmatched so an empty string will be added to the results list.
The solution is to make the group non-capturing
r'\d+(?:,\d+)?'
Option 2 does not work only so far as it won't match sequences that don't contain a comma.
Option 3 isn't great because it will match e.g. +,1.

I'd like to extract all the numbers from the following string ...
By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:
print re.findall(r'[0-9][0-9,.]+', my_str)
Output:
['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']
If by "numbers" you mean only the currency amounts, then use
print re.findall(r'[0-9]+,[0-9]+', my_str)
Or perhaps better yet,
print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)

Here is a solution, which parse the statement and put the result in a dictionary called bank_statement:
# -*- coding: utf-8 -*-
import itertools
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
bank_statement = {}
for line in my_str.split('\n'):
tokens = line.split()
#print tokens
it = iter(tokens)
category = ''
for token in it:
if token == ':':
category = category.strip(' *')
bank_statement[category] = next(it)
category = ''
else:
category += ' ' + token
# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
for k, v in sorted(bank_statement.items()))
The Output of this code:
Bruttodividende... 78,40
Depotinhaber...... ME
Einbeh. Steuer.... 20,67
Endbetrag......... 57,73
Extag............. 18.04.2013
Nettodividende.... 78,40
Valuta............ 18.04.2013
Zahlungstag....... 18.04.2013
pro Stück........ 0,9800
Discussion
The code scans the statement string line by line
It then breaks each line into tokens
Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value. bank_statement['Extag'] for example, has the value of '18.04.2013'
Please note that all the values are strings, not number, but it is trivia to convert them.

Try this one:
re.findall(r'\d+(?:[\d,.]*\d)', my_str)
This regex requires that the match at least starts with a number, then any amount of a mix of numbers, comma's and periods, and then it should end with a number too.

This question is relevant; the following
print re.findall(r'\d+(?:,\d+)?', my_str)
^^
ouputs
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
Excluding the "dotted" numbers is a little more complicated:
print re.findall(r'(?<!\d\.)\b\d+(?:,\d+)?\b(?!\.\d)', my_str)
^^^^^^^^^^^ ^^^^^^^^^^
This outputs
['0,9800', '78,40', '20,67', '78,40', '57,73']

Option 2 doesn't match numbers like '18.04.2013' because you are matching '\d+,\d+' which means
digit (one or more) comma digit (one or more)
For parsing digits in your case I'll use
\s(\d+[^\s]+)
which translates to
space (get digit [one or more] get everything != space)
space = \s
get digit = \d
one or more = + (so it becomes \d+)
get everything != space = [^\s]
one or more = + (so it becomes [^\s]+

My solution with out non-capturing group
re.findall(r'(?<!\.)\b\d+[,\d]*\d\b(?!\.)', my_str)
(?<!\.) - Check if there is no . before any number. (To
avoid '2013' in date 18.04.2013)
\b\d+ - number should be the start of the word string
[,\d]* - consider comma or number
\d\b - number should be the end of the word string
(?!\.) - Check if there is no . after any number. (To
avoid '18' or '04' in date 18.04.2013)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping data with a regex in Python - python

Related

Regex pattern to match multiple characters and split

How can I extract some patterns of sub text from a gibberish looking text using regular expressions?

regular expression for the extracting multiple patterns

regex to parse out certain value that i want

Python regular expression (regex) match comma separated number - why does this not work?

Categories

Resources