python regex sub not working due to escape sequence - python

I have this kind of text. -> Roberto is an insurance agent who sells two types of policies: a $$\$$50,000$$ policy and a $$\$$100,000$$ policy. Last month, his goal was to sell at least 57 insurance policies. While he did not meet his goal, the total value of the policies he sold was over $$\$$3,000,000$$. Which of the following systems of inequalities describes $$x$$, the possible number of $$\$$50,000$$ policies, and $$y$$, the possible number of $$\$$100,000$$ policies, that Roberto sold last month?
I want to replace expressions containing dollar signs such as $$\$$50,000$$. Removing things such as $$y$$ worked out quite well, but the expressions that contain escape sequence doesn't work well.
This is the code I used.
re.sub("$$\$$.*?$$", "", text)
This didn't work, and I found out that \ is a escape str, so should be written as \. So I replaced the expression as below.
re.sub("$$\\$$.*?$$", "", text)
However, this again didn't work. What am I doing wrong ? Thanks a lot in advance ...

The character $ is a regex metacharacter, and so will need to be escaped if intended to refer to a literal $:
text = """Roberto is an insurance agent who sells two types of policies: a $$\$$50,000$$ policy and a $$\$$100,000$$ policy. Last month, his goal was to sell at least 57 insurance policies. While he did not meet his goal, the total value of the policies he sold was over $$\$$3,000,000$$. Which of the following systems of inequalities describes $$x$$, the possible number of $$\$$50,000$$ policies, and $$y$$, the possible number of $$\$$100,000$$ policies, that Roberto sold last month?"""
output = re.sub(r'\$\$(?:\\\$\$)?.*?\$\$', '', text)
print(output)
The above pattern makes the \$$ optional, to cover all cases.

Related

Regex: marking a pattern

I'm trying to mark a sentence contains "manu" from it's nearest \n\n to it's nearest \n\n,
this is the text
\n\nHolds Certificate No: EMS 96453\nand operates an Environmental Management System which complies with the requirements of ISO for\n\nthe following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.\n\n/ tou\n\nFor and on behalf\n\n
I wanted to mark just this
the following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.
I tried this regex
\\n\\n(.+manu.+?)\\n\\n
but it's ignoring the nearest \n\n to my pattern and marks much more text than I want
Holds Certificate No: EMS 96453\nand operates an Environmental Management System which complies with the requirements of ISO for\n\nthe following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.
what am I missing?
The pattern starts at the left by matching \\n\\n followed by making use of the dot that matches any character. So it will match in this case manu without considering any characters in between.
You can use a pattern to match \\n\\n and make sure to not match it again before encountering manu
Then match until the first occurrence of \\n\\n after it, and capture the part that you want in a capture group.
\\n\\n((?:(?!\\n\\n).)+manu.+?)\\n\\n
Explanation
\\n\\n Match literally
( Capture group 1
(?:(?!\\n\\n).)+ Match any char asserting what is at the right is not \\n\\n
manu.+? Match manu followed by as least chars as possible
) Close group 1
\\n\\n Match literally
Regex demo
If you also want the match when it is either followed by \\n\\n or the end of the string:
\\n\\n((?:(?!\\n\\n).)+manu.+?)(?:\n\\n|$)
Regex demo

Regular expression to capture a group of words followed by a group of formatted quantities

Given the content of a text file (below), I want to extract two values from each line that has the following pattern — capture groups indicated with [#]:
An unknown amount of leading whitespace…
[1] a group of words (each separated by a single space)…
two or more spaces…
[2] a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses…
two or more spaces…
a quantity following the same pattern as the former
an unknown amount of trailing whitespace.
The goal is to capture the values under the "Notes" and "2019" columns in the text and put them into a Python dictionary.
I tried using the following regular expressions:
(\w+)\s{1}(\w+)*
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Example text file:
Micro-entity Balance Sheet as at 31 May 2019
Notes 2019 2018
£ £
Fixed Assets 2,046 1,369
Current Assets 53,790 24,799
Creditors: amounts falling due within one year (23,146) (6,106)
Net current assets (liabilities) 30,644 18,693
Total assets less current liabilities 32,690 20,062
Total net assets (liabilities) 32,690 20,062
Capital and reserves 32,690 20,062
For the year ending 31 May 2019 the company was entities to exemption under section 477 of the
Companies Act 2006 relating to small companies
® The members have not required the company to obtain an audit in accordance with section 476 of
the Companies Act 2006.
® The director acknowledge their responsibilities for complying with the requirements of the
Companies Act 2006 with respect to accounting records and the preparation of accounts.
® The accounts have been prepared in accordance with the micro-entity provisions and delivered in
accordance with the provisions applicable to companies subject to the small companies regime.
Approved by the Board on 20 December 2019
And signed on their behalf by:
Director
This document was delivered using electronic communications and authenticated in accordance with the
registrar's rules relating to electronic form, authentication and manner of delivery under section 1072 of
the Companies Act 2006.
Example valid matches:
"Fixed Assets", "2,046"
"Current Assets", "53,790"
"Creditors: amounts falling due within one year", "(23,146)"
"Net current assets (liabilities)", "30,644"
"Total assets less current liabilities", "32,690"
"Total net assets (liabilities)", "32,690"
"Capital and reserves", "32,690"
You're so close, but so far. Why?
Your first regular expression…
(\w+)\s{1}(\w+)*
…is insufficient because the two capture groups do not take into account the spaces between words in the first case or the quantity formatting in the second case.
Your second regular expression…
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
…is better because it effectively captures groups of words, however eagerly.
Notes:
You do not need capture groups around the leading and trailing whitespace.
You do not need brackets around the space character. The bracket indicates a set of characters, but you only have one character in the set.
If you modify it slightly by removing the unnecessary capture groups…
.*? {2,}(.*?) {2,}(.*?) {2,}.*
…you can see that it captures the values under "Notes" and "2019", but it also aggressively captures unwanted text.
You could parse through these matches and discard unwanted ones with Python code. You don't need a regular expression, but you can be more precise with it.
Your regular expression captures unwanted data because you're unnecessarily matching any character with .*?, when you actually want to limit the matches to:
a group of words (each separated by a single space)
a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses
Only the lines you care about actually follow this pattern.
Consider this:
^ *((?:\S+ )+) {2,}(\(?[0-9,]+\)?).*$
View # Regex101.com
The above regular expression improves the pattern matching in the following ways:
Explicitly match beginning of line ^ and end of line $ to prevent matching multiple lines.
Use a non-capturing group to match one or more words followed by a single space: (?:\S+ )+
Match non-whitespace characters with \S to capture "words" and punctuation (e.g. :).
Selectively match only a combination of one or more digits and commas optionally wrapped in parentheses with \(?[0-9,]+\)?
But even this returns the unwanted column headers "Notes" and "2019". You can use a negative lookahead… (?!Notes)…to prevent matching the line that contains "Notes".
Final solution:
^ *((?:(?!Notes)\S+ )+) {2,}((?[0-9,]+)?).*$
View # Regex101.com
You may find it educational to view it as a syntax diagram:
View # RegExper.com

How to extract text at newline using regex in python?

I am having trouble trying to extract text/values on a newline using regex.
Im trying to get ("REQUIRED QUALIFICATIONS:") values
if i use:-
pattern = re.compile(r"JOB RESPONSIBILITIES: .*")
matches = pattern.finditer(gh)
The output would be =
_<_sre.SRE_Match object; span=(161, 227), match='JOB DESCRIPTION:
Public outreach and strengthen>
BUT if i type:-
pattern = re.compile(r"REQUIRED QUALIFICATIONS: .*")
I will get =
match='REQUIRED QUALIFICATIONS: \r'>
Here is the text im trying to extract :
JOB RESPONSIBILITIES: \r\n- Working with the Country Director to
provide environmental information\r\nto the general public via regular
electronic communications and serving\r\nas the primary local contact
to Armenian NGOs and businesses and the\r\nArmenian offices of
international organizations and agencies;\r\n- Helping to organize and
prepare CENN seminars/ workshops;\r\n- Participating in defining the
strategy and policy of CENN in Armenia,\r\nthe Caucasus region and
abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally
related field, or 5 years relevant\r\nexperience;\r\n- Oral and
written fluency in Armenian, Russian and English;\r\n- Knowledge/
experience of working with environmental issues specific to\r\nArmenia
is a plus.\r\nREMUNERATION:
how do i solve this problem? Thanks in advance.
You can use : Positive Lookbehind (?<=REQUIRED QUALIFICATIONS:)
code:
import re
text = """
JOB RESPONSIBILITIES:
- Working with the Country Director to provide environmental information
to the general public via regular electronic communications and serving
as the primary local contact to Armenian NGOs and businesses and the
Armenian offices of international organizations and agencies;
- Helping to organize and prepare CENN seminars/ workshops;
- Participating in defining the strategy and policy of CENN in Armenia,
the Caucasus region and abroad.
REQUIRED QUALIFICATIONS:
- Degree in environmentally related field, or 5 years relevant
experience;
- Oral and written fluency in Armenian, Russian and English;
- Knowledge/ experience of working with environmental issues specific to
Armenia is a plus.
REMUNERATION:
"""
pattern =r'(?<=REQUIRED QUALIFICATIONS:)(\s.+)?REMUNERATION'
print(re.findall(pattern,text,re.DOTALL))
output:
['\n\n- Degree in environmentally related field, or 5 years relevant\n\nexperience;\n\n- Oral and written fluency in Armenian, Russian and English;\n\n- Knowledge/ experience of working with environmental issues specific to\n\nArmenia is a plus.\n\n']
regex information:
Positive Lookbehind (?<=REQUIRED QUALIFICATIONS:)
Assert that the Regex below matches
*REQUIRED QUALIFICATIONS*: matches the characters REQUIRED *QUALIFICATIONS*: literally (case sensitive)
*1st Capturing Group* (\s.+)?
*? Quantifier* — Matches between zero and one times, as
many times as possible, giving back as
needed (greedy)
*\s* matches any whitespace character (equal to
[\r\n\t\f\v ])
*.+* matches any character
*+* Quantifier — Matches between one and unlimited times,
as many times as possible, giving back as
needed
You may try this regex which is same with yours except that this includes an inline modifier, (?s) ( Single-line or Dot-all modifier which enables dot(.) indicate all characters including vertical white spaces , newline([\n\r]), etc so that enables manipulating multiple lines texts as like single line string.)
(?s)JOB RESPONSIBILITIES: .*
And I used re.match() function and get the full match strings from the group(0) as follows
ss="""JOB RESPONSIBILITIES: \r\n- Working with the Country Director to provide environmental information\r\nto the general public via regular electronic communications and serving\r\nas the primary local contact to Armenian NGOs and businesses and the\r\nArmenian offices of international organizations and agencies;\r\n- Helping to organize and prepare CENN seminars/ workshops;\r\n- Participating in defining the strategy and policy of CENN in Armenia,\r\nthe Caucasus region and abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally related field, or 5 years relevant\r\nexperience;\r\n- Oral and written fluency in Armenian, Russian and English;\r\n- Knowledge/ experience of working with environmental issues specific to\r\nArmenia is a plus.\r\nREMUNERATION:"""
pattern= re.compile(r"(?s)JOB RESPONSIBILITIES: .*")
print(pattern.match(ss).group(0))
output result is
JOB RESPONSIBILITIES:
- Working with the Country Director to provide environmental information
to the general public via regular electronic communications and serving
as the primary local contact to Armenian NGOs and businesses and the
Armenian offices of international organizations and agencies;
- Helping to organize and prepare CENN seminars/ workshops;
- Participating in defining the strategy and policy of CENN in Armenia,
the Caucasus region and abroad.
REQUIRED QUALIFICATIONS:
Additionally, you can set the Dot-all(or single-line) modifier through python re module's functions flag re.S like follows
pattern= re.compile(r"JOB RESPONSIBILITIES: .*",re.S)
For more information, please refer to re — Regular expression operations

regex catastrophic backtracking ; extracting words starts with capital before the specific word

I'm relatively new to Python world and having trouble with regex.
I'm trying to extract Firm's name before the word 'sale(s)' (or Sale(s)).
I found that Firm's names in my text data are all start with capital letter(and the other parts can be lowercase or uppercase or numbers or '-' or ', for example 'Abc Def' or 'ABC DEF' or just 'ABC' or 'Abc'),
and some of them are taking forms like ('Abc and Def' or 'Abc & Def').
For example,
from the text,
;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived
approximately 21% ($4,782,852) of its consolidated revenues from
continuing operations from direct transactions with Kmart Corporation.
Sales of Computer products was good. However, Computer's Parts and Display
Segment sale has been decreasing.
I only want to extract 'Computer's Parts and Display Segment'.
So I tried to create a regex
((?:(?:[A-Z]+[a-zA-Z\-0-9\']*\.?\s?(?:and |\& )?)+)+?(?:[S|s]ales?\s))
(
1.[A-Z]+[a-zA-Z-0-9\']*.?\s => this part is to find words start with capital letter and other parts are composed of a-z or A-Z or 0-9 or - or ' or . .
(?:and |\& )? => this part is to match word with and or & )
However, at https://regex101.com/ it calls out catastrophic backtracking and I read some related articles, but still cannot find way to solve this problem.
Could you help me?
Thanks!
Overview
Pointing out a few things in your pattern:
[a-zA-Z\-0-9\'] You don't need to escape ' here. Also, you can just place - at the start or end of the set and you won't need to escape it.
\& The ampersand character doesn't need to be escaped.
[S|s] Says to match either S, |, or s, thus you could potentially match |ales. The correct way to write this is [Ss].
Code
See regex in use here
(?:(?:[A-Z][\w'-]*|and) +)+(?=[sS]ales?)
Results
Input
;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. Sales of Computer products was good. However, Computer's Parts and Display Segment sale has been decreasing.
Output
Computer's Parts and Display Segment
Explanation
(?:(?:[A-Z][\w'-]*|and) +)+ Match this one or more times
(?:[A-Z][\w'-]*|and) Match either of the following
[A-Z][\w'-]* Match any uppercase ASCII character, followed by any number of word characters, apostrophes ' or hyphens -
and Match this literally
+ Match one or more spaces
(?=[sS]ales?) Positive lookahead ensuring any of the words sale, Sale, sales, or Sales follows

python regex negative lookahead method

I'm now extracting firm's name from the text data(10-k statement data).
I first tried using nltk StanfordTagger and extracted all the word tagged as organization. However, it quiet often failed to recall all the names of firms, and as I'm applying tagger to every single related sentence, it took such a long time.
So, I'm trying to extract all the words starting with Capital letter(or the words characters are all comprised of Capital letters).
So I find out that the regex below helpful.
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+
However, It cannot distinguish the name of segment from the name of firm.
For example,
sentence :
The Company's customers include, among others, Conner Peripherals Inc.("Conner"),
Maxtor Corporation ("Maxtor"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry.
I want to extract Conner Peripherals Inc, Conner, Maxtor Corporation, Maxtor, Applieds, but not 'Silicon Systems' since it is the name of segment.
So, I tried using
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+(?!segment|Segment)
However, it still extract 'Silicon Systems'.
Could you help me solving this problem?
(Or do you have any idea of how to extract only the firm's name from the text data?)
Thanks a lot!!!
You need to capture all consecutive texts! and then, mark individual words starting with caps as non-capturing(?:) so that you can capture consecutive words!
>>> re.findall("((?:[A-Z]+[a-zA-Z\-0-9']*\.?\s?)+)+?(?![Ss]egment)",sentence)
["The Company's ", 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation ', 'Maxtor', 'The ', 'Applieds ', '']
The NLTK approach, or any machine learning, seems to be a better approach here. I can only explain what the difficulty and current issue with the regex approach are.
The problem is that the matches expected can contain space separated phrases, and you want to avoid matching a certain phrase ending with segment. Even if you correct the negative lookahead as (?!\s*[Ss]egment), and make the pattern linear with something like \b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?(?!\s+[sS]egment), you will still match Silicon, a part of the unwanted match.
What you might try to do is to match all these entities and discard after matching, and only keep those entities in other contexts by capturing them into Group 1.
See the sample regex for this:
\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?\s+[sS]egment\b|(\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?)
Since it is unwieldy, you should think of building it from blocks, dynamically:
import re
entity_rx = r"\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?"
rx = r"{0}\s+[sS]egment\b|({0})".format(entity_rx)
s = "The Company's customers include, among others, Conner Peripherals Inc.(\"Conner\"), Maxtor Corporation (\"Maxtor\"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry."
matches = filter(None, re.findall(rx, s))
print(matches)
# => ['The Company', 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation', 'Maxtor', 'The', 'Applieds']
So,
\b - matches a word boundary
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
(?:\s+[A-Z][a-zA-Z0-9-]*)* - zero or more sequences of
\s+ - 1+ whitespaces
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
\b - trailing word boundary
\.? - an optional .
Then, this block is used to build
{0}\s+[sS]egment\b - the block we defined before followed with
\s+ - 1+ whitespaces
[sS]egment\b - either segment or Segment whole words
| - or
({0}) - Group 1 (what re.findall actually returns): the block we defined before.
filter(None, res) (in Python 2.x, in Python 3.x use list(filter(None, re.findall(rx, s)))) will filter out empty items in the final list.

Categories

Resources