I haven't used regex much and was having issues trying to split out 3 specific pieces of info in a long list of text I need to parse.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
pattern1 = Parse the **Name Text**
pattern2 = Parse the number `#x`
pattern3 = Grab everything else until the next pattern 1
What I have doesn't seem to work well. There are empty elements? They are not grouped together? And I can't figure out how to grab the last pattern text without it affecting the first 2 patterns. I'd also like it if all 3 matches were in a tuple together rather than separated. Here's what I have so far:
all = r"\*\*(.+?)\*\*|\`#(.+?)\`:"
l = re.findall(all, note)
Output:
[('Jane Greiz', ''), ('', '1'), ('Thomas Fitzpatrick', ''), ('', '90'), ('Anthony Smith', ''), ('', '91')]
Don't use alternatives. Put the name and number patterns after each other in a single alternative, and add another group for the match up to the next **.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
all = r"\*\*(.+?)\*\*.*?\`#(.+?)\`:(.*)"
print(re.findall(all, note))
Output is:
[('Jane Greiz', '1', ' Should be open here .'), ('Thomas Fitzpatrick', '90', ' Anim: Can we start the movement.'), ('Anthony Smith', '91', ' Her left shoulder.')]
I have some text where the age and gender of a person is mentioned in some of the records (not all) as 28M, or 35 F, or 29 male, or 57Female, etc.
I wrote the following regular expression to check if there is any pattern that matches a number followed by a M in an input string, and if yes to print it out, but the code does not print anything:
import re
text = 'Decision: Standard\r\n\r\n 36M NS\r\nBasic - 500th MP tdb addd cib 250th\r\n\r\nDue Date: Settlement date'
test_search = re.search('[0-9]+M', text)
if test_search:
print("Age: "+test_search.group(0)+", Gender: "+test_search.group(1))
I expected it to have printed Age: 36, Gender: M. However, it does nothing - no error, no output, nothing.
I tried re.match('[0-9]+F', text), nothing happened there either.
Also, I thought I have to write as many regular expressions as there are patterns (one each for 28M, 35 F, 29Male, 57 female, etc). Is that the correct approach? Or is there a way to search/find/match all of these patterns at once?
You may use this regex to match all the cases you have mentioned in question:
results = re.findall(r'(?i)(\d+)\s*([mf]|(?:fe)?male)\b', text)
RegEx Demo
Details:
(?i): Ignore case modifier
(\d+): Match and capture 1+ digits in group #1
\s*: Match 0 or more whitespaces
([mf]|(?:fe)?male): Match or capture M or F or male or female in group #2
\b: word boundary
You can use this regex
([0-9]+)\s?([M|Male|F|Female]+) and capture the age and name in seperate capturing groups.
Python Demo
import re
test_str = r"Decision: Standard\\r\\n\\r\\n 36M NS\\r\\nBasic - 500th MP tdb addd cib 250th\\r\\n\\r\\nDue Date: Settlement date 29 male 57Female 35 F"
pattern = r"([0-9]+)\s?([M|Male|F|Female]+)"
def return_gender_dict(match_obj):
return { 'age': match_obj[0], 'gender': match_obj[1][0].upper() }
matches = re.findall(pattern, test_str, flags=re.MULTILINE | re.IGNORECASE)
result = [return_gender_dict(match) for match in matches]
print(result)
Outputting:
[{'age': '36', 'gender': 'M'}, {'age': '29', 'gender': 'M'}, {'age': '57', 'gender': 'F'}, {'age': '35', 'gender': 'F'}]
Try the following re
(\d\d)(M|F|Male|Female|\sM|\sF|\sMale|\sFemale)
I have string like this
string="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
tokens=re.findall('(.*)\r\n(.*?:)(.*?])',string)
Output
('Claim Status', '[Primary Status:', ' Paidup to Rebilled]')
('General Info.', '[PA Number:', ' R180126187]')
('Claim Insurance: Modified', '[Ins. Mode:', ' Primary]')
Wanted output:
('Claim Status', 'Primary Status:Paidup to Rebilled')
('General Info.', 'PA Number:R180126187')
('Claim Insurance: Modified', 'Ins. Mode:Primary','ICN: ########', 'Id: #########')
You may achieve what you need with a solution like this:
import re
s="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
res = []
for m in re.finditer(r'^(.+)(?:\r?\n\s*\[(.+)])?\r?$', s, re.M):
t = []
t.append(m.group(1).strip())
if m.group(2):
t.extend([x.strip() for x in m.group(2).strip().split('], [') if ':' in x])
res.append(tuple(t))
print(res)
See the Python online demo. Output:
[('Claim Status', 'Primary Status: Paidup to Rebilled'), ('General Info.', 'PA Number: #######'), ('Claim Insurance: Modified', 'Ins. Mode: Primary', 'ICN: #######', 'Id: ########')]
With the ^(.+)(?:\r?\n\s*\[(.+)])?\r?$ regex, you match two consecutive lines with the second being optional (due to the (?:...)? optional non-capturing group), the first is captured into Group 1 and the subsequent one (that starts with [ and ends with ]) is captured into Group 2. (Note that \r?$ is necessary since in the multiline mode $ only matches before a newline and not a carriage return.) Group 1 value is added to a temporary list, then the contents of the second group is split with ], [ (if you are not sure about the amount of whitespace, you may use re.split(r']\s*,\s*\[', m.group(2))) and then only add those items that contain a : in them to the temporary list.
You are getting three elements per result because you are using "capturing" regular expressions. Rewrite your regexp like this to combine the second and third match:
re.findall('(.*)\r\n((?:.*?:)(?:.*?]))',string)
A group delimited by (?:...) (instead of (...)) is "non-capturing", i.e. it doesn't count as a match target for \1 etc., and it does not get "seen" by re.findall. I have made both your groups non-capturing, and added a single capturing (regular) group around them.
Using https://regex101.com/
MY current regex Expression: ^.*'(\d\s*.*)'*$
which doesnt seem to be working. What is the right combination formula that i should use?
I want to able to parse out 4 variable namely
items, quantity, cost and Total
MY CODE:
import re
str = "xxxxxxxxxxxxxxxxxx"
match = re.match(r"^.*'(\d\s*.*)'*$",str)
print match.group(1)
The following regex matches each ingredient string and stores wanted informations into groups: r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$'
It defines 3 groups each separated from other by spaces:
^ marks the string start
(\d+) is the first group and looks for at least one digit
\s+ is the first separation between groups and looks for at least one white character
([A-Za-z ]+) is the second group and looks for a least one alphabetical character or space
\s+ is the second separation beween groups and looks for at least one white character
(\d+(?:\.\d*) is the third group and looks for at least one digit with eventually a decimal point and some other digits
$ marks the string end
A regex to obtain the total does not need to be explained I think.
Here is a test code using your test data. Is should be a good starting point:
import re
TEST_DATA = ['Table: Waiter: kenny',
'======================================',
'1 SAUSAGE WRAPPED WITH B 10.00',
'1 ESCARGOT WITH GARLIC H 12.00',
'1 PAN SEARED FOIE GRAS 15.00',
'1 SAUTE FIELD MUSHROOM W 9.00',
'1 CRISPY CHICKEN WINGS 7.00',
'1 ONION RINGS 6.00',
'----------------------------------',
'TOTAL 59.00',
'CASH 59.00',
'CHANGE 0.00',
'Signature:__________________________',
'Thank you & see you again soon!']
INGREDIENT_RE = re.compile(r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$')
TOTAL_RE = re.compile(r'^TOTAL (.+)$')
ingredients = []
total = None
for string in TEST_DATA:
match = INGREDIENT_RE.match(string)
if match:
ingredients.append(match.groups())
continue
match = TOTAL_RE.match(string)
if match:
total = match.groups()[0]
break
print(ingredients)
print(total)
this prints:
[('1', 'SAUSAGE WRAPPED WITH B', '10.00'), ('1', 'ESCARGOT WITH GARLIC H', '12.00'), ('1', 'PAN SEARED FOIE GRAS', '15.00'), ('1', 'SAUTE FIELD MUSHROOM W', '9.00'), ('1', 'CRISPY CHICKEN WINGS', '7.00'), ('1', 'ONION RINGS', '6.00')]
59.00
Edit on Python raw strings:
The r character before a Python string indicates that it is a raw string, which means that spécial characters (like \t, \n, etc...) are not interpreted.
To be clear, and for example, in a standard string \t is one tabulation character. It a raw string it is two characters: \ and t.
r'\t' is equivalent to '\\t'.
more details in the doc
I am trying to parse transaction letters from my (German) bank.
I'd like to extract all the numbers from the following string which turns out to be harder than I thought.
Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.
My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?
Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.
# -*- coding: utf-8 -*-
import re
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)
Output is
['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
Option 1 is the most suitable of the regex, but it is not working correctly because findall will return what is matched by the capture group (), not the complete match.
For example, the first three matches in your example will be the 18, 04 and 2013, and in each case the capture group will be unmatched so an empty string will be added to the results list.
The solution is to make the group non-capturing
r'\d+(?:,\d+)?'
Option 2 does not work only so far as it won't match sequences that don't contain a comma.
Option 3 isn't great because it will match e.g. +,1.
I'd like to extract all the numbers from the following string ...
By "numbers", if you mean both the currency amounts AND the dates, I think that this will do what you want:
print re.findall(r'[0-9][0-9,.]+', my_str)
Output:
['18.04.2013', '18.04.2013', '0,9800', '18.04.2013', '78,40', '20,67', '78,40', '57,73']
If by "numbers" you mean only the currency amounts, then use
print re.findall(r'[0-9]+,[0-9]+', my_str)
Or perhaps better yet,
print re.findall(r'[0-9]+,[0-9]+ EUR', my_str)
Here is a solution, which parse the statement and put the result in a dictionary called bank_statement:
# -*- coding: utf-8 -*-
import itertools
my_str = """
Dividendengutschrift für inländische Wertpapiere
Depotinhaber : ME
Extag : 18.04.2013 Bruttodividende
Zahlungstag : 18.04.2013 pro Stück : 0,9800 EUR
Valuta : 18.04.2013
Bruttodividende : 78,40 EUR
*Einbeh. Steuer : 20,67 EUR
Nettodividende : 78,40 EUR
Endbetrag : 57,73 EUR
"""
bank_statement = {}
for line in my_str.split('\n'):
tokens = line.split()
#print tokens
it = iter(tokens)
category = ''
for token in it:
if token == ':':
category = category.strip(' *')
bank_statement[category] = next(it)
category = ''
else:
category += ' ' + token
# bank_statement now has all the values
print '\n'.join('{0:.<18} {1}'.format(k, v) \
for k, v in sorted(bank_statement.items()))
The Output of this code:
Bruttodividende... 78,40
Depotinhaber...... ME
Einbeh. Steuer.... 20,67
Endbetrag......... 57,73
Extag............. 18.04.2013
Nettodividende.... 78,40
Valuta............ 18.04.2013
Zahlungstag....... 18.04.2013
pro Stück........ 0,9800
Discussion
The code scans the statement string line by line
It then breaks each line into tokens
Scanning through the tokens and look for the colon. If found, use the part before the colon as category, and the part after that as value. bank_statement['Extag'] for example, has the value of '18.04.2013'
Please note that all the values are strings, not number, but it is trivia to convert them.
Try this one:
re.findall(r'\d+(?:[\d,.]*\d)', my_str)
This regex requires that the match at least starts with a number, then any amount of a mix of numbers, comma's and periods, and then it should end with a number too.
This question is relevant; the following
print re.findall(r'\d+(?:,\d+)?', my_str)
^^
ouputs
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']
Excluding the "dotted" numbers is a little more complicated:
print re.findall(r'(?<!\d\.)\b\d+(?:,\d+)?\b(?!\.\d)', my_str)
^^^^^^^^^^^ ^^^^^^^^^^
This outputs
['0,9800', '78,40', '20,67', '78,40', '57,73']
Option 2 doesn't match numbers like '18.04.2013' because you are matching '\d+,\d+' which means
digit (one or more) comma digit (one or more)
For parsing digits in your case I'll use
\s(\d+[^\s]+)
which translates to
space (get digit [one or more] get everything != space)
space = \s
get digit = \d
one or more = + (so it becomes \d+)
get everything != space = [^\s]
one or more = + (so it becomes [^\s]+
My solution with out non-capturing group
re.findall(r'(?<!\.)\b\d+[,\d]*\d\b(?!\.)', my_str)
(?<!\.) - Check if there is no . before any number. (To
avoid '2013' in date 18.04.2013)
\b\d+ - number should be the start of the word string
[,\d]* - consider comma or number
\d\b - number should be the end of the word string
(?!\.) - Check if there is no . after any number. (To
avoid '18' or '04' in date 18.04.2013)