python regex negative lookahead method - python

I'm now extracting firm's name from the text data(10-k statement data).
I first tried using nltk StanfordTagger and extracted all the word tagged as organization. However, it quiet often failed to recall all the names of firms, and as I'm applying tagger to every single related sentence, it took such a long time.
So, I'm trying to extract all the words starting with Capital letter(or the words characters are all comprised of Capital letters).
So I find out that the regex below helpful.
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+
However, It cannot distinguish the name of segment from the name of firm.
For example,
sentence :
The Company's customers include, among others, Conner Peripherals Inc.("Conner"),
Maxtor Corporation ("Maxtor"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry.
I want to extract Conner Peripherals Inc, Conner, Maxtor Corporation, Maxtor, Applieds, but not 'Silicon Systems' since it is the name of segment.
So, I tried using
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+(?!segment|Segment)
However, it still extract 'Silicon Systems'.
Could you help me solving this problem?
(Or do you have any idea of how to extract only the firm's name from the text data?)
Thanks a lot!!!

You need to capture all consecutive texts! and then, mark individual words starting with caps as non-capturing(?:) so that you can capture consecutive words!
>>> re.findall("((?:[A-Z]+[a-zA-Z\-0-9']*\.?\s?)+)+?(?![Ss]egment)",sentence)
["The Company's ", 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation ', 'Maxtor', 'The ', 'Applieds ', '']

The NLTK approach, or any machine learning, seems to be a better approach here. I can only explain what the difficulty and current issue with the regex approach are.
The problem is that the matches expected can contain space separated phrases, and you want to avoid matching a certain phrase ending with segment. Even if you correct the negative lookahead as (?!\s*[Ss]egment), and make the pattern linear with something like \b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?(?!\s+[sS]egment), you will still match Silicon, a part of the unwanted match.
What you might try to do is to match all these entities and discard after matching, and only keep those entities in other contexts by capturing them into Group 1.
See the sample regex for this:
\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?\s+[sS]egment\b|(\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?)
Since it is unwieldy, you should think of building it from blocks, dynamically:
import re
entity_rx = r"\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?"
rx = r"{0}\s+[sS]egment\b|({0})".format(entity_rx)
s = "The Company's customers include, among others, Conner Peripherals Inc.(\"Conner\"), Maxtor Corporation (\"Maxtor\"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry."
matches = filter(None, re.findall(rx, s))
print(matches)
# => ['The Company', 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation', 'Maxtor', 'The', 'Applieds']
So,
\b - matches a word boundary
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
(?:\s+[A-Z][a-zA-Z0-9-]*)* - zero or more sequences of
\s+ - 1+ whitespaces
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
\b - trailing word boundary
\.? - an optional .
Then, this block is used to build
{0}\s+[sS]egment\b - the block we defined before followed with
\s+ - 1+ whitespaces
[sS]egment\b - either segment or Segment whole words
| - or
({0}) - Group 1 (what re.findall actually returns): the block we defined before.
filter(None, res) (in Python 2.x, in Python 3.x use list(filter(None, re.findall(rx, s)))) will filter out empty items in the final list.

Related

How to find a specific, pre-defined word surrounded by any word(s) starting with a capital letter(s)?

I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
You can use 2 capture groups instead, and match a single word starting with a capital A-Z on the left or on the right.
Using [^\S\r\n] will match a whitespace char without a newline, as \s can match a newline
\b[A-Z]\w*[^\S\r\n]+(Test|Study)\b|\b(Test|Study)[^\S\r\n]+[A-Z]\w*
Regex demo
Ok, this is possibly way out of the actual scope but you could use the newer regex module with subroutines:
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)
See a demo on regex101.com (and mind the modifiers!).
In actual code, this could be:
import regex as re
junk = """
I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
"""
pattern = re.compile(r'''
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)''', re.VERBOSE)
for match in pattern.finditer(junk):
print(match.group(0))
And would yield
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
((?:[A-Z]\w+\s+){0,5}\bStudy\b\s*(?:[A-Z]\w+\b\s*){0,5})
Test
I have to further test it to check whether it works for the all of the possible scenarios in a real world. I might need to adjust '5' in the expression to a lower or higher number(s) to optimize my algorithm's performance, though. I tested it on some real datasets already and the results have been promising so far. It is fast.

Python regex A|B|C matches C even though B should match

I've been sitting on this problem for several hours now and I really don't know anymore...
Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (i.e. once a match is found, the other regex' are not tested anymore).
This is my code:
text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
+ expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])
m = re_exp.search(text)
print(m.group(0))
I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (i.e. parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.
Either way, with the example above I would expect to get the followinging output:
American Heart Association
but what I'm getting is
American College of Cardiology (ACC)/American Heart Association
which is the match for the final regex.
If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text), I get the output I want, meaning the regex is in fact matching properly.
What gives?
The regex looks like this:
American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association
The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.
You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.
As the pattern matches from left to right and encounters the first occurrence with American, the first and the second alternatives can not match American College of Cardiology.
Then the third alternation can match it, and due to the .*? it can stretch until the first occurrence of Association.
What you might do is for example exclude possible characters to match using a negated character class:
\bAmerican\b[^/,.]*\bAssociation\b
Regex demo
Or you might use a tempered greedy token approach to not allow specific words between the first and last part:
\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b
Regex demo
So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w), "/". Your regex will match "American [whatever random stuff here] Heart Association". This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association". E.g. if you deleted the first "American" in your string you would get the match you are looking for with your regex.
You need to be more restrictive with your regex to rule out situations like these.

Regex: marking a pattern

I'm trying to mark a sentence contains "manu" from it's nearest \n\n to it's nearest \n\n,
this is the text
\n\nHolds Certificate No: EMS 96453\nand operates an Environmental Management System which complies with the requirements of ISO for\n\nthe following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.\n\n/ tou\n\nFor and on behalf\n\n
I wanted to mark just this
the following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.
I tried this regex
\\n\\n(.+manu.+?)\\n\\n
but it's ignoring the nearest \n\n to my pattern and marks much more text than I want
Holds Certificate No: EMS 96453\nand operates an Environmental Management System which complies with the requirements of ISO for\n\nthe following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.
what am I missing?
The pattern starts at the left by matching \\n\\n followed by making use of the dot that matches any character. So it will match in this case manu without considering any characters in between.
You can use a pattern to match \\n\\n and make sure to not match it again before encountering manu
Then match until the first occurrence of \\n\\n after it, and capture the part that you want in a capture group.
\\n\\n((?:(?!\\n\\n).)+manu.+?)\\n\\n
Explanation
\\n\\n Match literally
( Capture group 1
(?:(?!\\n\\n).)+ Match any char asserting what is at the right is not \\n\\n
manu.+? Match manu followed by as least chars as possible
) Close group 1
\\n\\n Match literally
Regex demo
If you also want the match when it is either followed by \\n\\n or the end of the string:
\\n\\n((?:(?!\\n\\n).)+manu.+?)(?:\n\\n|$)
Regex demo

Regular expression to capture a group of words followed by a group of formatted quantities

Given the content of a text file (below), I want to extract two values from each line that has the following pattern — capture groups indicated with [#]:
An unknown amount of leading whitespace…
[1] a group of words (each separated by a single space)…
two or more spaces…
[2] a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses…
two or more spaces…
a quantity following the same pattern as the former
an unknown amount of trailing whitespace.
The goal is to capture the values under the "Notes" and "2019" columns in the text and put them into a Python dictionary.
I tried using the following regular expressions:
(\w+)\s{1}(\w+)*
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Example text file:
Micro-entity Balance Sheet as at 31 May 2019
Notes 2019 2018
£ £
Fixed Assets 2,046 1,369
Current Assets 53,790 24,799
Creditors: amounts falling due within one year (23,146) (6,106)
Net current assets (liabilities) 30,644 18,693
Total assets less current liabilities 32,690 20,062
Total net assets (liabilities) 32,690 20,062
Capital and reserves 32,690 20,062
For the year ending 31 May 2019 the company was entities to exemption under section 477 of the
Companies Act 2006 relating to small companies
® The members have not required the company to obtain an audit in accordance with section 476 of
the Companies Act 2006.
® The director acknowledge their responsibilities for complying with the requirements of the
Companies Act 2006 with respect to accounting records and the preparation of accounts.
® The accounts have been prepared in accordance with the micro-entity provisions and delivered in
accordance with the provisions applicable to companies subject to the small companies regime.
Approved by the Board on 20 December 2019
And signed on their behalf by:
Director
This document was delivered using electronic communications and authenticated in accordance with the
registrar's rules relating to electronic form, authentication and manner of delivery under section 1072 of
the Companies Act 2006.
Example valid matches:
"Fixed Assets", "2,046"
"Current Assets", "53,790"
"Creditors: amounts falling due within one year", "(23,146)"
"Net current assets (liabilities)", "30,644"
"Total assets less current liabilities", "32,690"
"Total net assets (liabilities)", "32,690"
"Capital and reserves", "32,690"
You're so close, but so far. Why?
Your first regular expression…
(\w+)\s{1}(\w+)*
…is insufficient because the two capture groups do not take into account the spaces between words in the first case or the quantity formatting in the second case.
Your second regular expression…
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
…is better because it effectively captures groups of words, however eagerly.
Notes:
You do not need capture groups around the leading and trailing whitespace.
You do not need brackets around the space character. The bracket indicates a set of characters, but you only have one character in the set.
If you modify it slightly by removing the unnecessary capture groups…
.*? {2,}(.*?) {2,}(.*?) {2,}.*
…you can see that it captures the values under "Notes" and "2019", but it also aggressively captures unwanted text.
You could parse through these matches and discard unwanted ones with Python code. You don't need a regular expression, but you can be more precise with it.
Your regular expression captures unwanted data because you're unnecessarily matching any character with .*?, when you actually want to limit the matches to:
a group of words (each separated by a single space)
a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses
Only the lines you care about actually follow this pattern.
Consider this:
^ *((?:\S+ )+) {2,}(\(?[0-9,]+\)?).*$
View # Regex101.com
The above regular expression improves the pattern matching in the following ways:
Explicitly match beginning of line ^ and end of line $ to prevent matching multiple lines.
Use a non-capturing group to match one or more words followed by a single space: (?:\S+ )+
Match non-whitespace characters with \S to capture "words" and punctuation (e.g. :).
Selectively match only a combination of one or more digits and commas optionally wrapped in parentheses with \(?[0-9,]+\)?
But even this returns the unwanted column headers "Notes" and "2019". You can use a negative lookahead… (?!Notes)…to prevent matching the line that contains "Notes".
Final solution:
^ *((?:(?!Notes)\S+ )+) {2,}((?[0-9,]+)?).*$
View # Regex101.com
You may find it educational to view it as a syntax diagram:
View # RegExper.com

regex catastrophic backtracking ; extracting words starts with capital before the specific word

I'm relatively new to Python world and having trouble with regex.
I'm trying to extract Firm's name before the word 'sale(s)' (or Sale(s)).
I found that Firm's names in my text data are all start with capital letter(and the other parts can be lowercase or uppercase or numbers or '-' or ', for example 'Abc Def' or 'ABC DEF' or just 'ABC' or 'Abc'),
and some of them are taking forms like ('Abc and Def' or 'Abc & Def').
For example,
from the text,
;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived
approximately 21% ($4,782,852) of its consolidated revenues from
continuing operations from direct transactions with Kmart Corporation.
Sales of Computer products was good. However, Computer's Parts and Display
Segment sale has been decreasing.
I only want to extract 'Computer's Parts and Display Segment'.
So I tried to create a regex
((?:(?:[A-Z]+[a-zA-Z\-0-9\']*\.?\s?(?:and |\& )?)+)+?(?:[S|s]ales?\s))
(
1.[A-Z]+[a-zA-Z-0-9\']*.?\s => this part is to find words start with capital letter and other parts are composed of a-z or A-Z or 0-9 or - or ' or . .
(?:and |\& )? => this part is to match word with and or & )
However, at https://regex101.com/ it calls out catastrophic backtracking and I read some related articles, but still cannot find way to solve this problem.
Could you help me?
Thanks!
Overview
Pointing out a few things in your pattern:
[a-zA-Z\-0-9\'] You don't need to escape ' here. Also, you can just place - at the start or end of the set and you won't need to escape it.
\& The ampersand character doesn't need to be escaped.
[S|s] Says to match either S, |, or s, thus you could potentially match |ales. The correct way to write this is [Ss].
Code
See regex in use here
(?:(?:[A-Z][\w'-]*|and) +)+(?=[sS]ales?)
Results
Input
;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. Sales of Computer products was good. However, Computer's Parts and Display Segment sale has been decreasing.
Output
Computer's Parts and Display Segment
Explanation
(?:(?:[A-Z][\w'-]*|and) +)+ Match this one or more times
(?:[A-Z][\w'-]*|and) Match either of the following
[A-Z][\w'-]* Match any uppercase ASCII character, followed by any number of word characters, apostrophes ' or hyphens -
and Match this literally
+ Match one or more spaces
(?=[sS]ales?) Positive lookahead ensuring any of the words sale, Sale, sales, or Sales follows

Categories

Resources