Python: Using regex to find the last pair of occurence - python

Attached is a text file that I want to parse. I want to select the text in the last combination of the words' occurrence:
(1) Item 7 Management Discussion Analysis
(2) Item 8 Financial Statements
I would usually use regex as follow:
re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements",text, re.DOTALL)
You can see in the text file, the combination of Item 7 and Item 8 occurs often but if I find the last match (1) and last match (2), I increase by a lot the probability to grab the desired text.
The desired text in my text file starts with:
"'This Item 7, Management's Discussion and
Analysis of Financial Condition and Results of Operations, and other
parts of this Form 10-K contain forward-looking statements, within the
meaning of the Private Securities Litigation Reform Act of 1995, that
involve risks and..... "
and ends with:
"Item 8.
Financial Statements and Supplementary Data"
How can I adapt my regex code to grab this last pair between Item 7 and Item 8?
UPDATE:
I also try to parse this file using the same items.

This code has been rewritten. It now works with both the original data file (Output2.txt) and the newly added data file (Output2012.txt).
import re
discussions = []
for input_file_name in ['Output2.txt', 'Output2012.txt']:
with open(input_file_name) as f:
doc = f.read()
item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
discussion_text = r"[\S\s]*"
item8 = r"Item 8\.*\s*Financial Statements"
discussion_pattern = item7 + discussion_text + item8
results = re.findall(discussion_pattern, doc)
# Some input files have table of contents and others don't
# just keep the last match
discussion = results[len(results)-1]
discussions.append((input_file_name, discussion))
The discussions variable contains the results for each of the data files.
This is the original solution. It does not work for the new file but does show the use of named groups. I am not familiar with StackOverflow protocol here. Should I delete this old code?
By using longer match strings, the number of matches can be reduced to just 2 for both item 7
and item 8 - the table of contents and the actual section.
So search for the second occurence of item 7, and keep all text until item 8. This code uses
Python named groups.
import re
with open('Output2.txt') as f:
doc = f.read()
item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
item8 = r"Item 8\.*\s*Financial Statements"
discussion_pattern = re.compile(
r"(?P<item7>" + item7 + ")"
r"([\S\s]*)"
r"(?P<item7heading>" + item7 +")"
r"(?P<discussion>[\S\s]*)"
r"(?P<item8heading>" + item8 + ")"
)
match = re.search(discussion_pattern, doc)
discussion = match.group('discussion')

Use this pattern with s option
.*(Item 7.*?Item 8)
result at capturing group #1
Demo
. # Any character except line break
* # (zero or more)(greedy)
( # Capturing Group (1)
Item 7 # "Item 7"
. # Any character except line break
*? # (zero or more)(lazy)
Item 8 # "Item 8"
) # End of Capturing Group (1)
# " "

re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements(?!.*?(?:Item(?:(?!Item).)*7)|(?:Item(?:(?!Item).)*8))",text, re.DOTALL)
Try this.Added a lookahead .

Related

what is the fast way to match words in text?

i have a list of regex like :
regex_list = [".+rive.+",".+ll","[0-9]+ blue car.+"......] ## list of length 3000
what is the best method to match all this regex to my text
for example :
text : Hello, Owning 2 blue cars for a single driver
so in the output , i want to have a list of matched words :
matched_words = ["Hello","4 blue cars","driver"] ##Hello <==>.+llo
Alright, first of all, you will probably want to adjust your regex_list, because of now, matching those strings will give you the entire text back as match. This is because of .+, which states that there may follow any character any amount of time. What I have done here is the following:
import re
regex_list = [".rive.",".+ll.","[0-9]+ blue car."]
text = "Hello, Owning 2 blue cars for a single driver"
# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
print(matching_texts)
I adjusted your regex_list slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following
['Hello', '2 blue cars', 'driver']
NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.
You could also try this which is multi threaded version of #Lexpj answer
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
# list of length 3000
regex_list = [".rive.", ".+ll.", "[0-9]+ blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "
def test(text, regex):
# Returns all the spans of matched regex items in text
spans = [re.search(regex, text).span()]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
return matching_texts
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(test, my_string, regex)
for regex in regex_list}
# as_completed() gives you the threads once finished
matched = set()
for f in as_completed(futures):
# Get the results
rs = f.result()
matched = matched.union(set(rs))
print(matched)
Looking at the desired result, your regexes are not correct. You don't want to match .+, but \w+, and also with the second regex, you'll want to match some letters after ll too.
The main idea is then to make one regex for all, by concatenating them with the | symbol:
import re
regex_list = [r"\w+rive\w+", r"\w+ll\w+", r"\d+ blue car\w+"]
regex = re.compile('|'.join(regex_list))
text = "Hello, Owning 2 blue cars for a single driver "
print(regex.findall(text)) # ["Hello","2 blue cars","driver"]
This still could give undesired effects when there is a part of your string that would match with more than one regex in the list. In that case the first will "win". So make sure that when multiple regexes could match the same text, they are ordered along their desired priority.

Split String based on multiple Regex matches

First of all, I checked these previous posts, and did not help me. 1 & 2 & 3
I have this string (or a similar case could be) that need to be handled with regex:
"Text Table 6-2: Management of children study and actions"
What I am supposed to do is detect the word Table and the word(s) before if existed
detect the numbers following and they can be in this format: 6 or 6-2 or 66-22 or 66-2
Finally the rest of the string (in this case: Management of children study and actions)
After doing so, the return value must be like this:
return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions
Below is my code:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
print(parts_of_title)
print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])
The first requirement is returned true as should be but the second doesn't so, I changed the code and used compile but the regex functionality changed, the code is like this:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
print(parts_of_title)
Output:
True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']
So based on this, how I can achieve this and stick to a clean and readable code? and why does using compile change the matching?
The matching changes because:
In the first part, you call .group().split() where .group() returns the full match which is a string.
In the second part, you call re.compile("...").split() where re.compile returns a regular expression object.
In the pattern, this part will match only a single word [a-zA-Z0-9]+[ ], and if this part should be in a capture group [0-9]([-][0-9]+)? the first (single) digit is currently not part of the capture group.
You could write the pattern writing 4 capture groups:
^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)
See a regex demo.
import re
pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2: Management of children study and actions"
m = re.match(pattern, s)
if m:
print(m.groups())
Output
('Text ', 'Table', '6-2', 'Management of children study and actions')
If you want point 1 and 2 as one string, then you can use 2 capture groups instead.
^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)
Regex demo
The output will be
('Text Table 6-2', 'Management of children study and actions')
you have already had answers but I wanted to try your problem to train myself so I give you all the same what I found if you are interested:
((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)
And here is the link to my tests: https://regex101.com/r/7VpPM2/1

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)
There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

text selection using greedy and reluctant

I am trying to extract text between markers. But these markers are word tokens and repeat frequently. input is EDGAR txt files.
My markers are ITEM some number ie. ITEM 1 and ITEM2.
my regex is
MDA_regex = 'item[^a-zA-Z\n]*\d\s*\.\s*management\'s discussion and analysis.*?item[^a-zA-Z\n]*\d\s*'
It is working fine but it fails if keyword item\d... occur between ITEM 1 AND ITEM 2. If i use .* it goes to other unwanted marker as the reports may contains other item\d.... If use .*? it will stuck at first occurance of item.
I cannot hardcode 1 and 2 because different reports can have different position/header ie. item 7 to item 8 for my required text. I am using python
for fileName in os.listdir(path):
fileName = os.path.join(path, fileName)
if os.path.isfile(fileName):
print("opening new file " + fileName)
with open(fileName, 'r', encoding='utf-8', errors='replace') as in_file:
content = in_file.read().replace('\n',' ')
mda_list=re.findall(MDA_regex,content, re.IGNORECASE|re.DOTALL)
print(mda_list)
print(len(mda_list))
My input is like
Management's Discussion and Analysis of Financial Condition and
Results of Operations............................................. 22
Item 3.
ITEM 1 . MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION
AND RESULTS OF OPERATIONS.
FORWARD-LOOKING STATEMENTS
This report contains forward-looking statements. Additional
written or oral forward-looking statements may be made by AMERCO from
time to time in filings with the Securities and Exchange Commission or
otherwise. Management believes such forward-looking statements are
within the meaning of the safe-harbor provisions. Such item 1 statements may
include, but not be limited to, projections of revenues, income or
<PAGE> 33
ITEM 2. QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK
DFDFDLF;DF SDLKD dlfldfkdffd;fl;l sdsl; dklkkdsmm,sd item 4
DDFLDFL dlkdsldkf dldfd;lf;f
Also 'item' can not be caps because some reports have caps and some don,t.
Can someone suggest how to handle this. Should i use multiple regex in if-else condition to check each condition.

Regex sub only removes certain expressions

I'm running a program which creates product labels based on csv data. The function which I am struggling with takes a data structure which consists of a number combination(width of a wooden plank) and a string (name of product). Possible combinations I search for are as follows:
5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF
My function needs to take in the data, split the width from the name and return them both as separate variables as follows:
desc = row[1]
if filter.lower() in desc.lower():
size = re.search(r'(\d{1})(\-*)(\d{0,1})(\/*)(\d{0,2})(\+*)(\d{0,1})(\-*)(\d{0,1})(\/*)(\d{0,2})', desc)
if size:
# remove size from description
desc = re.sub(size.group(), '', desc)
size = size.group() # extract match from obj
else:
size = "None"
The function does as intended with the first two samples, however when it comes across the last product, it recognizes the size but does not remove it from description. Screen shot below shows the output after I print (size + \n + desc)
Is there an issue with my re expression or elsewhere?
Thanks
re.sub() expects its first argument to be a regex. It works for the first two because they don't contain any characters that have special meaning in the context, however the third contains +, which is special.
There's not actually any reason to use regex there... regular string replacement should work:
desc = desc.replace(size.group(), '')
Why replace and not simply match what you need?
import re
text = """5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF""".split('\n')
print(text)
for t in text:
pattern = r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
m = re.search(pattern,t)
print(m.group('size'))
print(m.group('species'))
Output:
5
MAPLE PEPPER-ANTIQUE
3-1/4
MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4
MAPLE TIMBERWOLF
Regex:
r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
2 named groups, between them 0-n spaces.
1st group only 0123456789-+/ allowed
2nd group any but 0123456789 allowed

Categories

Resources