Regex: marking a pattern - python

I'm trying to mark a sentence contains "manu" from it's nearest \n\n to it's nearest \n\n,
this is the text
\n\nHolds Certificate No: EMS 96453\nand operates an Environmental Management System which complies with the requirements of ISO for\n\nthe following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.\n\n/ tou\n\nFor and on behalf\n\n
I wanted to mark just this
the following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.
I tried this regex
\\n\\n(.+manu.+?)\\n\\n
but it's ignoring the nearest \n\n to my pattern and marks much more text than I want
Holds Certificate No: EMS 96453\nand operates an Environmental Management System which complies with the requirements of ISO for\n\nthe following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.
what am I missing?

The pattern starts at the left by matching \\n\\n followed by making use of the dot that matches any character. So it will match in this case manu without considering any characters in between.
You can use a pattern to match \\n\\n and make sure to not match it again before encountering manu
Then match until the first occurrence of \\n\\n after it, and capture the part that you want in a capture group.
\\n\\n((?:(?!\\n\\n).)+manu.+?)\\n\\n
Explanation
\\n\\n Match literally
( Capture group 1
(?:(?!\\n\\n).)+ Match any char asserting what is at the right is not \\n\\n
manu.+? Match manu followed by as least chars as possible
) Close group 1
\\n\\n Match literally
Regex demo
If you also want the match when it is either followed by \\n\\n or the end of the string:
\\n\\n((?:(?!\\n\\n).)+manu.+?)(?:\n\\n|$)
Regex demo

Related

Python regex A|B|C matches C even though B should match

I've been sitting on this problem for several hours now and I really don't know anymore...
Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (i.e. once a match is found, the other regex' are not tested anymore).
This is my code:
text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
+ expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])
m = re_exp.search(text)
print(m.group(0))
I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (i.e. parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.
Either way, with the example above I would expect to get the followinging output:
American Heart Association
but what I'm getting is
American College of Cardiology (ACC)/American Heart Association
which is the match for the final regex.
If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text), I get the output I want, meaning the regex is in fact matching properly.
What gives?
The regex looks like this:
American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association
The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.
You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.
As the pattern matches from left to right and encounters the first occurrence with American, the first and the second alternatives can not match American College of Cardiology.
Then the third alternation can match it, and due to the .*? it can stretch until the first occurrence of Association.
What you might do is for example exclude possible characters to match using a negated character class:
\bAmerican\b[^/,.]*\bAssociation\b
Regex demo
Or you might use a tempered greedy token approach to not allow specific words between the first and last part:
\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b
Regex demo
So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w), "/". Your regex will match "American [whatever random stuff here] Heart Association". This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association". E.g. if you deleted the first "American" in your string you would get the match you are looking for with your regex.
You need to be more restrictive with your regex to rule out situations like these.

Regular expression to capture a group of words followed by a group of formatted quantities

Given the content of a text file (below), I want to extract two values from each line that has the following pattern — capture groups indicated with [#]:
An unknown amount of leading whitespace…
[1] a group of words (each separated by a single space)…
two or more spaces…
[2] a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses…
two or more spaces…
a quantity following the same pattern as the former
an unknown amount of trailing whitespace.
The goal is to capture the values under the "Notes" and "2019" columns in the text and put them into a Python dictionary.
I tried using the following regular expressions:
(\w+)\s{1}(\w+)*
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Example text file:
Micro-entity Balance Sheet as at 31 May 2019
Notes 2019 2018
£ £
Fixed Assets 2,046 1,369
Current Assets 53,790 24,799
Creditors: amounts falling due within one year (23,146) (6,106)
Net current assets (liabilities) 30,644 18,693
Total assets less current liabilities 32,690 20,062
Total net assets (liabilities) 32,690 20,062
Capital and reserves 32,690 20,062
For the year ending 31 May 2019 the company was entities to exemption under section 477 of the
Companies Act 2006 relating to small companies
® The members have not required the company to obtain an audit in accordance with section 476 of
the Companies Act 2006.
® The director acknowledge their responsibilities for complying with the requirements of the
Companies Act 2006 with respect to accounting records and the preparation of accounts.
® The accounts have been prepared in accordance with the micro-entity provisions and delivered in
accordance with the provisions applicable to companies subject to the small companies regime.
Approved by the Board on 20 December 2019
And signed on their behalf by:
Director
This document was delivered using electronic communications and authenticated in accordance with the
registrar's rules relating to electronic form, authentication and manner of delivery under section 1072 of
the Companies Act 2006.
Example valid matches:
"Fixed Assets", "2,046"
"Current Assets", "53,790"
"Creditors: amounts falling due within one year", "(23,146)"
"Net current assets (liabilities)", "30,644"
"Total assets less current liabilities", "32,690"
"Total net assets (liabilities)", "32,690"
"Capital and reserves", "32,690"
You're so close, but so far. Why?
Your first regular expression…
(\w+)\s{1}(\w+)*
…is insufficient because the two capture groups do not take into account the spaces between words in the first case or the quantity formatting in the second case.
Your second regular expression…
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
…is better because it effectively captures groups of words, however eagerly.
Notes:
You do not need capture groups around the leading and trailing whitespace.
You do not need brackets around the space character. The bracket indicates a set of characters, but you only have one character in the set.
If you modify it slightly by removing the unnecessary capture groups…
.*? {2,}(.*?) {2,}(.*?) {2,}.*
…you can see that it captures the values under "Notes" and "2019", but it also aggressively captures unwanted text.
You could parse through these matches and discard unwanted ones with Python code. You don't need a regular expression, but you can be more precise with it.
Your regular expression captures unwanted data because you're unnecessarily matching any character with .*?, when you actually want to limit the matches to:
a group of words (each separated by a single space)
a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses
Only the lines you care about actually follow this pattern.
Consider this:
^ *((?:\S+ )+) {2,}(\(?[0-9,]+\)?).*$
View # Regex101.com
The above regular expression improves the pattern matching in the following ways:
Explicitly match beginning of line ^ and end of line $ to prevent matching multiple lines.
Use a non-capturing group to match one or more words followed by a single space: (?:\S+ )+
Match non-whitespace characters with \S to capture "words" and punctuation (e.g. :).
Selectively match only a combination of one or more digits and commas optionally wrapped in parentheses with \(?[0-9,]+\)?
But even this returns the unwanted column headers "Notes" and "2019". You can use a negative lookahead… (?!Notes)…to prevent matching the line that contains "Notes".
Final solution:
^ *((?:(?!Notes)\S+ )+) {2,}((?[0-9,]+)?).*$
View # Regex101.com
You may find it educational to view it as a syntax diagram:
View # RegExper.com

How to extract text at newline using regex in python?

I am having trouble trying to extract text/values on a newline using regex.
Im trying to get ("REQUIRED QUALIFICATIONS:") values
if i use:-
pattern = re.compile(r"JOB RESPONSIBILITIES: .*")
matches = pattern.finditer(gh)
The output would be =
_<_sre.SRE_Match object; span=(161, 227), match='JOB DESCRIPTION:
Public outreach and strengthen>
BUT if i type:-
pattern = re.compile(r"REQUIRED QUALIFICATIONS: .*")
I will get =
match='REQUIRED QUALIFICATIONS: \r'>
Here is the text im trying to extract :
JOB RESPONSIBILITIES: \r\n- Working with the Country Director to
provide environmental information\r\nto the general public via regular
electronic communications and serving\r\nas the primary local contact
to Armenian NGOs and businesses and the\r\nArmenian offices of
international organizations and agencies;\r\n- Helping to organize and
prepare CENN seminars/ workshops;\r\n- Participating in defining the
strategy and policy of CENN in Armenia,\r\nthe Caucasus region and
abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally
related field, or 5 years relevant\r\nexperience;\r\n- Oral and
written fluency in Armenian, Russian and English;\r\n- Knowledge/
experience of working with environmental issues specific to\r\nArmenia
is a plus.\r\nREMUNERATION:
how do i solve this problem? Thanks in advance.
You can use : Positive Lookbehind (?<=REQUIRED QUALIFICATIONS:)
code:
import re
text = """
JOB RESPONSIBILITIES:
- Working with the Country Director to provide environmental information
to the general public via regular electronic communications and serving
as the primary local contact to Armenian NGOs and businesses and the
Armenian offices of international organizations and agencies;
- Helping to organize and prepare CENN seminars/ workshops;
- Participating in defining the strategy and policy of CENN in Armenia,
the Caucasus region and abroad.
REQUIRED QUALIFICATIONS:
- Degree in environmentally related field, or 5 years relevant
experience;
- Oral and written fluency in Armenian, Russian and English;
- Knowledge/ experience of working with environmental issues specific to
Armenia is a plus.
REMUNERATION:
"""
pattern =r'(?<=REQUIRED QUALIFICATIONS:)(\s.+)?REMUNERATION'
print(re.findall(pattern,text,re.DOTALL))
output:
['\n\n- Degree in environmentally related field, or 5 years relevant\n\nexperience;\n\n- Oral and written fluency in Armenian, Russian and English;\n\n- Knowledge/ experience of working with environmental issues specific to\n\nArmenia is a plus.\n\n']
regex information:
Positive Lookbehind (?<=REQUIRED QUALIFICATIONS:)
Assert that the Regex below matches
*REQUIRED QUALIFICATIONS*: matches the characters REQUIRED *QUALIFICATIONS*: literally (case sensitive)
*1st Capturing Group* (\s.+)?
*? Quantifier* — Matches between zero and one times, as
many times as possible, giving back as
needed (greedy)
*\s* matches any whitespace character (equal to
[\r\n\t\f\v ])
*.+* matches any character
*+* Quantifier — Matches between one and unlimited times,
as many times as possible, giving back as
needed
You may try this regex which is same with yours except that this includes an inline modifier, (?s) ( Single-line or Dot-all modifier which enables dot(.) indicate all characters including vertical white spaces , newline([\n\r]), etc so that enables manipulating multiple lines texts as like single line string.)
(?s)JOB RESPONSIBILITIES: .*
And I used re.match() function and get the full match strings from the group(0) as follows
ss="""JOB RESPONSIBILITIES: \r\n- Working with the Country Director to provide environmental information\r\nto the general public via regular electronic communications and serving\r\nas the primary local contact to Armenian NGOs and businesses and the\r\nArmenian offices of international organizations and agencies;\r\n- Helping to organize and prepare CENN seminars/ workshops;\r\n- Participating in defining the strategy and policy of CENN in Armenia,\r\nthe Caucasus region and abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally related field, or 5 years relevant\r\nexperience;\r\n- Oral and written fluency in Armenian, Russian and English;\r\n- Knowledge/ experience of working with environmental issues specific to\r\nArmenia is a plus.\r\nREMUNERATION:"""
pattern= re.compile(r"(?s)JOB RESPONSIBILITIES: .*")
print(pattern.match(ss).group(0))
output result is
JOB RESPONSIBILITIES:
- Working with the Country Director to provide environmental information
to the general public via regular electronic communications and serving
as the primary local contact to Armenian NGOs and businesses and the
Armenian offices of international organizations and agencies;
- Helping to organize and prepare CENN seminars/ workshops;
- Participating in defining the strategy and policy of CENN in Armenia,
the Caucasus region and abroad.
REQUIRED QUALIFICATIONS:
Additionally, you can set the Dot-all(or single-line) modifier through python re module's functions flag re.S like follows
pattern= re.compile(r"JOB RESPONSIBILITIES: .*",re.S)
For more information, please refer to re — Regular expression operations

python regex negative lookahead method

I'm now extracting firm's name from the text data(10-k statement data).
I first tried using nltk StanfordTagger and extracted all the word tagged as organization. However, it quiet often failed to recall all the names of firms, and as I'm applying tagger to every single related sentence, it took such a long time.
So, I'm trying to extract all the words starting with Capital letter(or the words characters are all comprised of Capital letters).
So I find out that the regex below helpful.
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+
However, It cannot distinguish the name of segment from the name of firm.
For example,
sentence :
The Company's customers include, among others, Conner Peripherals Inc.("Conner"),
Maxtor Corporation ("Maxtor"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry.
I want to extract Conner Peripherals Inc, Conner, Maxtor Corporation, Maxtor, Applieds, but not 'Silicon Systems' since it is the name of segment.
So, I tried using
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+(?!segment|Segment)
However, it still extract 'Silicon Systems'.
Could you help me solving this problem?
(Or do you have any idea of how to extract only the firm's name from the text data?)
Thanks a lot!!!
You need to capture all consecutive texts! and then, mark individual words starting with caps as non-capturing(?:) so that you can capture consecutive words!
>>> re.findall("((?:[A-Z]+[a-zA-Z\-0-9']*\.?\s?)+)+?(?![Ss]egment)",sentence)
["The Company's ", 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation ', 'Maxtor', 'The ', 'Applieds ', '']
The NLTK approach, or any machine learning, seems to be a better approach here. I can only explain what the difficulty and current issue with the regex approach are.
The problem is that the matches expected can contain space separated phrases, and you want to avoid matching a certain phrase ending with segment. Even if you correct the negative lookahead as (?!\s*[Ss]egment), and make the pattern linear with something like \b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?(?!\s+[sS]egment), you will still match Silicon, a part of the unwanted match.
What you might try to do is to match all these entities and discard after matching, and only keep those entities in other contexts by capturing them into Group 1.
See the sample regex for this:
\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?\s+[sS]egment\b|(\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?)
Since it is unwieldy, you should think of building it from blocks, dynamically:
import re
entity_rx = r"\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?"
rx = r"{0}\s+[sS]egment\b|({0})".format(entity_rx)
s = "The Company's customers include, among others, Conner Peripherals Inc.(\"Conner\"), Maxtor Corporation (\"Maxtor\"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry."
matches = filter(None, re.findall(rx, s))
print(matches)
# => ['The Company', 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation', 'Maxtor', 'The', 'Applieds']
So,
\b - matches a word boundary
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
(?:\s+[A-Z][a-zA-Z0-9-]*)* - zero or more sequences of
\s+ - 1+ whitespaces
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
\b - trailing word boundary
\.? - an optional .
Then, this block is used to build
{0}\s+[sS]egment\b - the block we defined before followed with
\s+ - 1+ whitespaces
[sS]egment\b - either segment or Segment whole words
| - or
({0}) - Group 1 (what re.findall actually returns): the block we defined before.
filter(None, res) (in Python 2.x, in Python 3.x use list(filter(None, re.findall(rx, s)))) will filter out empty items in the final list.

Regex which matches the longer string in an OR

Motivation
I'm parsing addresses and need to get the address and the country in separated matches, but the countries might have aliases, e.g.:
UK == United Kingdom,
US == USA == United States,
Korea == South Korea,
and so on...
Explanation
So, what I do is create a big regex with all possible country names (at least the ones more likely to appear) separated by the OR operator, like this:
germany|us|france|chile
But the problem is with multi-word country names and their shorter versions, like:
Republic of Moldova and Moldova
Using this as example, we have the string:
'Somewhere in Moldova, bla bla, 12313, Republic of Moldova'
What I want to get from this:
'Somewhere in Moldova, bla bla, more bla, 12313'
'Republic of Moldova'
But this is what I get:
'Somewhere in Moldova, bla bla, 12313, Republic of'
'Moldova'
Regex
As there are several cases, here is what I'm using so far:
^(.*),? \(?(republic of moldova|moldova)\)?(.*[\d\-]+.*|,.*[:/].*)?$
As we might have fax, phone, zip codes or something else after the country name - which I don't care about - I use the last matching group to remove them:
(.*[\d\-]+.*|,.*[:/].*)?
Also, sometimes the country name comes enclosed in parenthesis, so I have \(? and \)? around the second match group, and all the countries go inside it:
(republic of moldova|moldova|...)
Question
The thing is, when there is an entry which is a subset of a bigger one, the shorter is chosen over the longer, and the remainder stays in the base_address string.
Is there a way to tell the regex to choose over the biggest possible match when two values mach?
Edit
I'm using Python with built in re module
As suggested by m.buettner, changing the first matching group from (.*) to (.*?) indeed fixes the current issue, but it also creates another. Consider other example:
'Department of Chemistry, National University of Singapore, 4512436 Singapore'
Matches:
'Department of Chemistry, National University of'
'Singapore'
Here it matches too soon now.
Your problem is greediness.
The .* right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .* and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .* matches everything up to Moldova (so .* is still consuming Republic of). And then the alternation (which still cannot match republic of moldova) will gladly match moldova and return that as the result.
The simplest solution is to make the repetition ungreedy:
^(.*?)...
Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .* completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.
EDIT:
There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:
^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$
Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:
There is no word character before the mandatory space preceding the country.
There is no word character after the country that is separated by nothing but spaces.
Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.
EDIT 2:
As abarnert mentioned in a comment, how about not using a regex-only solution?
You could split the string on ,, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.
Sort all alternatives in regex, just create regex programatically by sorted (from longest to shortest) array of names. Then make whole regex in atomic group (PCRE engine has it, don't know if RE engine has it too). Because of atomic group, regex engine never backtrack to try other alternative in atomic group and so u have all alternatives sorted, match will always be the longest one.
Tada.

Categories

Resources