Reading financial statements using REGEX - python

I'm working on a project where I have to read scanned images of financial statements. I used tesseract 4 to convert the image to a text output, which looks as such (here is a snippet):
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
I would like to break the above into a list of three entries, where the first entry is the text, then the second and third entries would be the numbers. For example the first row would look something like this:
[[REVENUE], [9,000,000], [9,000,000]]
I came across this stack overflow post where someone attempts to use re.match() to the .groups() method to find the pattern: How to split strings into text and number?
I'm just being introduced to regex and I'm struggling to properly understand the syntax and documentation. I'm trying to use a cheat sheet for now, but I'm having a tough time figuring out how to go about this, please help.

I wrote this regex through watching your first expected output. But i am not sure what your desired output is with your third sentence.
([A-Za-z ]+)(?=\d|\S) match name until we found a number or symbol.
.*? for the string which we do not care
([\d,]+)\s([\d,]+|(?=-\n|-$)) match one or two groups of number, if there is only one group of number, this group should end with newline or end of text.
Test code(edited):
import re
regex = r"([A-Za-z ]+)(?=\d|\S).*?([\d,]+)\s([\d,]+|(?=-\n|-$))"
text = """
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
Business taxes 999 -
"""
print(re.findall(regex,text))
# [('REVENUE ', '9,000,000', '900,000'), ('COST OF SALES ', '900,000', '900,000'), ('GROSS PROFIT ', '900,000', '900,000'), ('Business taxes ', '999', '')]

Regexes are overkill for this problem as you've stated it.
text.split() and a join of the items before the last two is better suited to this.
lines = [ "REVENUE 9,000,000 900,000",
"COST OF SALES 900,000 900,000",
"GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000" ]
out = []
for line in lines:
parts = line.split()
if len(parts) < 3:
raise InputError
if len(parts) == 3:
out.append(parts)
else:
out.append([' '.join(parts[0:len(parts)-2]), parts[-2], parts[-1]])
out will contain
[['REVENUE', '9,000,000', '900,000'],
['COST OF SALES', '900,000', '900,000'],
['GROSS PROFIT (90%; 2016 - 90%)', '900,000', '900,000']]
If the label text needs further extraction, you could use regexes, or you could simply look at the items in parts[0:len(parts)-2] and process them based on the words and numbers there.

To detect the string
rev_str = "[[REVENUE], [9,000,000], [9,000,000]]"
and extract the values
("REVENUE", "9,000,000", "9,000,000")
you would do
import re
x = re.match(r"\[\[([A-Z]+)\], \[([0-9,]+)\], \[([0-9,]+)\]\]", rev_str)
x.groups()
# ('REVENUE', '9,000,000', '9,000,000')
Let's unpack this big ol' string.
Square brackets signify a range of characters. For example, [A-Z] means to look for all letters from A to Z, whereas [0-9,] means to look for the digits 0 through 9, as well as the character ,. The - here is an operator used inside square brackets to denote a range of characters that we want.
The + operator means to look for at least one occurrence of whatever immediately precedes it. For example, the expression [A-Z]+ means to look for at least one occurrence of any of the letters A through Z. You can also use the * operator instead, to look for at least zero occurrences of whatever precedes it.
The round brackets (i.e. parentheses) signify a group to be extracted from the regex. Whenever that pattern is matched, whatever is inside any expression in parentheses will be extracted and returned as a group. For example, ([A-Z+]) means to look for at least one occurrence of any of the letters A through Z, and then save whatever that turns out to be. We access this by doing x.groups() after assigning the result of the regex match to a variable x.
Otherwise, it's straightforward - accommodating for the pattern [[TEXT], [NUMBER], [NUMBER]]. The square brackets are escaped with the \ character, because we want to interpret them literally, rather than as a range of characters.
Overall, the re.match() function will search rev_str for any places where the given pattern matches, keep track of the groups within that match, and return those groups when you call x.groups().
This is a fairly simple example, but you've gotta start somewhere, right? You should be able to use this as a starting point for making a more complicated regex expression to process more of your code.

Related

Python: Use Regex to Match Phone Number And Print Tuple (w/Formatting Constraints)

I want to write code that can parse American phone numbers (ie. "(664)298-4397") . Below are the constraints:
allow leading and trailing white spaces
allow white spaces that appear between area code and local numbers
no white spaces in area code or the seven digit number XXX-XXXX
Ultimately I want to print a tuple of strings (area_code, first_three_digits_local, last_four_digits_local)
I have two sets of questions.
Question 1:
Below are inputs my code should accept and print the tuple for:
'(664) 298-4397', '(664)298-4397', ' (664) 298-4397'
Below is the code I tried:
regex_parse1 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664) 298-4397')
print (f' groups are: {regex_parse1.groups()} \n')
regex_parse2 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664)298-4397')
print (f' groups are: {regex_parse2.groups()} \n')
regex_parse3 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' (664) 298-4397')
print (f' groups are: {regex_parse3.groups()}')
The string input for all three are valid and should return the tuple:
('664', '298', '4397')
But instead I'm getting the output below for all three:
groups are: ('', '', '4397')
What am I doing wrong?
Question 2:
The following two chunks of code should output an 'NoneType' object has no attribute 'group' error because the input phone number string violates the constraints. But instead, I get outputs for all three.
regex_parse4 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(404)555 -1212')
print (f' groups are: {regex_parse4.groups()}')
regex_parse5 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' ( 404)121-2121')
print (f' groups are: {regex_parse5.groups()}')
Expected output: should be an error but I get this instead for all three:
groups are: ('', '', '2121')
What is wrong with my regex code?
In general, your regex overuse the asterisk *. Details as follows:
You have 3 capturing groups:
([\s]*[(]*[0-9]*[)]*[\s]*)
([\s]*[0-9]*)
([0-9]*[\s]*)
You use asterisk on every single item, including the open and close parenthesis. Actually, almost everything in your regex is quoted with asterisk. Thus, the capturing groups match also null strings. That's why your first and second capturing groups return the null strings. The only item you don't use asterisk is the hyphen sign - just before the third capturing group. This is also the reason why your regex can capture the third capturing group as in the 4397 and 2121
To solve your problem, you have to use asterisk only when needed.
In fact, your regex still has plenty of rooms for improvement. For example, it now matches numeric digits of any length (instead of 3 or 4 digits chunks). It also allows the area code not enclosed in parenthesis (because of your use of asterisk around parenthesis symbols.
For this kind of common regex, I suggest you don't need to reinvent the wheel. You can refer to some already made regex easily found from the Internet. For example, you can refer to this post Although the post is using javascript instead of Python, the regex is just similar.
Try:
regex_parse4 = re.match(r'([(]*[0-9]{3}[)])\s*([0-9]{3}).([0-9]{4})', number)
Assumes 3 digit area code in parentheses, proceeded by XXX-XXXX.
Python returns 'NoneType' when there are no matches.
If above does not work, here is a helpful regex tool:
https://regex101.com
Edit:
Another suggestion is to clean data prior to applying a new regex. This helps with instances of abnormal spacing, gets rid of parentheses, and '-'.
clean_number = re.sub("[^0-9]", "", original_number)
regex_parse = re.match(r'([0-9]{3})([0-9]{3})([0-9]{4})', clean_number)
print(f'groups are: {regex_parse}.groups()}')
>>> ('xxx', 'xxx', 'xxxx')

Regex to split phrases separated into columns by many whitespaces

I'm hoping to get some regex assistance. I've got lines of columnar text that I'd like to split with regexes. Each column can be phrases of arbitrary characters, separated by a whitespace or maybe even two. Columns are separated by a larger number of whitespaces, perhaps at least 4.
Ultimately, I need to match a date if its in the second column.
Here's an example. I need the date in this column to be the group important_date
Rain in Spain 11/01/2000 90 Days
important_date should not match the date in this next line:
Another line of text 10/15/1990
# EXAMPLE:
import re
regex = r"(.*)\s(?P<important_date>\d{1,2}\/\d{1,2}\/\d{4}).*"
match_this = " Rain in Spain 11/01/2000 90 Days"
not_this = " Another line of text 10/15/1990"
print(f"Finding this date is good:{re.search(regex, match_this).group('important_date')}" )
print(f"But this one should throw an error:{re.search(regex,not_this).group('important_date')}")
I'm also comparing these regexes against lots of other lines of text with various structures, so this is why I don't want to just split on a string of " ". To know I've got the important_date, I need to know that the whole line looks like: one-column, second column is date, maybe another column after the date too.
Doing this with a single regex would also just fit much more easily into the rest of the application. I'm worried that line.split(" ") and checking the resulting list would interfere with other checks going on in this app.
I have not been able to figure out how to write the first part of the regex that captures words with no-more-than-2 spaces between them. Can I use lookaheads for this somehow?
Thank you!
Try this: (?m)^\s*(\w+\s)+\s+(?P<important_date>\d\d/\d\d/\d\d\d\d).*$ (https://regex101.com/r/PnIU3e/3).
I assume that the first column consists of words separated by single spaces, and is separated from the second column by more than one space.
You can split on fields of 2 or more spaces and only use the data if it is the second field:
for x in (match_this, not_this):
te=re.split(r'[ \t]{2,}',x)
if re.match(r'\d{1,2}\/\d{1,2}\/\d{4}', te[2]):
# you have an important date
print(te[2])
else:
# you don't
print('no match')

Match pattern 1 and/or pattern 2

I have multiple file names that are either a movie title or an episode in a TV show. For the movie titles I want to match the year the movie came out, and for the episode I want to match the season and episode number in the format S00E00. However, I can't known that the string contains either or, sometimes it can contain both the season and episode and the year. I also don't known what comes first in the string, the year or the season and episode.
I tried with the following pattern: (\d{4})|S(\d\d)E(\d\d), however that only returns a match for the one that came first. For the string 2012.S01E02, it returns 2012, and for the string S01E02.2012 it returns S01E02. The rest of the capture groups is None (I'm using Python 3.5).
I have a solution which uses two separate matches, if-statements and generally looks ugly. Is there's a way to have one regex pattern that returns a list (or tuple) witch contains (year, season, episode), regardless of what comes first in the string?
You could use the following regular expression:
.*?(\d{4}).*?(S\d{2}E\d{2}).*?|.*?(S\d{2}E\d{2}).*?(\d{4}).*?|.*?(S\d{2}E\d{2}).*?|.*?(\d{4}).*?
.*?(\d{4}).*?(S(\d\d)E(\d\d)).*?: This will first match the combination of the year and episode number in this order.
.*?(S(\d\d)E(\d\d)).*?(\d{4}).*?: This will match the reverse order
.*?(S(\d\d)E(\d\d)).*?: This will match the episode number
.*?(\d{4}).*?: This will match the year.
If you execute the regular expression in this order, you will always get both the year and the episode number.
var regex = /.*?(\d{4}).*?(S\d{2}E\d{2}).*?|.*?(S\d{2}E\d{2}).*?(\d{4}).*?|.*?(S\d{2}E\d{2}).*?|.*?(\d{4}).*?/;
var matches = "test|S02E12|2012_test".match(regex);
matches = matches.filter(function(item) {
return item !== undefined;
}).splice(1).sort();
console.log(matches);

Boolean search text file in Python

I have a text file with 32 articles. Each article starts with the expression: <Number> of 32 DOCUMENTS, for example: 1 of 32 DOCUMENTS, 2 of 32 DOCUMENTS, etc. In order to find each article I have used the following code:
import re
sections = []
current = []
with open("Aberdeen2005.txt") as f:
for line in f:
if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
sections.append("".join(current))
current = [line]
else:
current.append(line)
print(len(sections))
So now, articles are represented by the expression sections
The next thing I want to do, is to subgroup the articles in 2 groups. Those articles containing the words: economy OR economic AND uncertainty OR uncertain AND tax OR policy, identify them with the number 1.
Whereas those articles containing the following words: economy OR economic AND uncertain OR uncertainty AND regulation OR spending, identify them with the number 2. This is what I have tried so far:
for i in range(len(sections)):
group1 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[tax|policy]", , sections[i])
group2 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[regulation|spending]", , sections[i])
Nevertheless, it does not seem to work. Any ideas why?
It's a bit wordy, but you can get away without using regular expressions here, for example:
# Take a lowercase copy for comparisons
s = sections[i].lower()
if (('economic' in s or 'economy' in s) and
('uncertainty' in s or 'uncertain' in s) and
('tax' in s or 'policy' in s)):
do_stuff()
It is possible to write this as a single regular expression, but it is a bit tricky. For each and you'd use a zero-width lookahead assertion (?= ), and for each or you'd use a branch. Also, we'd have to use the \b for a word boundary. We'd use re.match instead of re.search.
belongs_to_group1 = bool(re.match(
r'(?=.*\b(?:economic|economy)\b)'
r'(?=.*\b(?:uncertain|uncertainty)\b)'
r'(?=.*\b(?:tax|policy)\b)', text, re.I))
Thus not very readable.
A more fruitful approach would be to find all words and put them into a set
words = set(re.findall(r'\w+', text.lower()))
belongs_to_group1 = (('uncertainty' in words or 'uncertain' in words)
and ('economic' in words or 'economy' in words)
and ('tax' in words or 'policy' in words))
You can use re.search to find those words. Then you can use if statements and python's and and or statements for the logic, and then store group one and two as two lists with the section index number as a value.
One thing you might want to note is that your logic may need brackets.
By
economy OR economic AND uncertainty OR uncertain AND tax OR policy
I assume you mean
(economy OR economic) AND (uncertainty OR uncertain) AND (tax OR policy)
which is different to (for example)
economy OR (economic AND uncertainty) OR (uncertain AND tax) OR policy
EDIT1:
Python will evaluate your statement without brackets from left to right, i.e.:
( ( ( ( (economy OR economic) AND uncertainty) OR uncertain) AND tax) OR policy)
Which I imagine is not what you want (e.g. the above evaluates true if it includes the word policy but none of the others)
EDIT2:
As pointed out in comments, EDIT1 is incorrect, although you would still need brackets to achieve case 1, if you don't have them you will get case 2 instead (and case 3 is a load of rubbish)

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Categories

Resources