How to extract text at newline using regex in python? - python

I am having trouble trying to extract text/values on a newline using regex.
Im trying to get ("REQUIRED QUALIFICATIONS:") values
if i use:-
pattern = re.compile(r"JOB RESPONSIBILITIES: .*")
matches = pattern.finditer(gh)
The output would be =
_<_sre.SRE_Match object; span=(161, 227), match='JOB DESCRIPTION:
Public outreach and strengthen>
BUT if i type:-
pattern = re.compile(r"REQUIRED QUALIFICATIONS: .*")
I will get =
match='REQUIRED QUALIFICATIONS: \r'>
Here is the text im trying to extract :
JOB RESPONSIBILITIES: \r\n- Working with the Country Director to
provide environmental information\r\nto the general public via regular
electronic communications and serving\r\nas the primary local contact
to Armenian NGOs and businesses and the\r\nArmenian offices of
international organizations and agencies;\r\n- Helping to organize and
prepare CENN seminars/ workshops;\r\n- Participating in defining the
strategy and policy of CENN in Armenia,\r\nthe Caucasus region and
abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally
related field, or 5 years relevant\r\nexperience;\r\n- Oral and
written fluency in Armenian, Russian and English;\r\n- Knowledge/
experience of working with environmental issues specific to\r\nArmenia
is a plus.\r\nREMUNERATION:
how do i solve this problem? Thanks in advance.

You can use : Positive Lookbehind (?<=REQUIRED QUALIFICATIONS:)
code:
import re
text = """
JOB RESPONSIBILITIES:
- Working with the Country Director to provide environmental information
to the general public via regular electronic communications and serving
as the primary local contact to Armenian NGOs and businesses and the
Armenian offices of international organizations and agencies;
- Helping to organize and prepare CENN seminars/ workshops;
- Participating in defining the strategy and policy of CENN in Armenia,
the Caucasus region and abroad.
REQUIRED QUALIFICATIONS:
- Degree in environmentally related field, or 5 years relevant
experience;
- Oral and written fluency in Armenian, Russian and English;
- Knowledge/ experience of working with environmental issues specific to
Armenia is a plus.
REMUNERATION:
"""
pattern =r'(?<=REQUIRED QUALIFICATIONS:)(\s.+)?REMUNERATION'
print(re.findall(pattern,text,re.DOTALL))
output:
['\n\n- Degree in environmentally related field, or 5 years relevant\n\nexperience;\n\n- Oral and written fluency in Armenian, Russian and English;\n\n- Knowledge/ experience of working with environmental issues specific to\n\nArmenia is a plus.\n\n']
regex information:
Positive Lookbehind (?<=REQUIRED QUALIFICATIONS:)
Assert that the Regex below matches
*REQUIRED QUALIFICATIONS*: matches the characters REQUIRED *QUALIFICATIONS*: literally (case sensitive)
*1st Capturing Group* (\s.+)?
*? Quantifier* — Matches between zero and one times, as
many times as possible, giving back as
needed (greedy)
*\s* matches any whitespace character (equal to
[\r\n\t\f\v ])
*.+* matches any character
*+* Quantifier — Matches between one and unlimited times,
as many times as possible, giving back as
needed

You may try this regex which is same with yours except that this includes an inline modifier, (?s) ( Single-line or Dot-all modifier which enables dot(.) indicate all characters including vertical white spaces , newline([\n\r]), etc so that enables manipulating multiple lines texts as like single line string.)
(?s)JOB RESPONSIBILITIES: .*
And I used re.match() function and get the full match strings from the group(0) as follows
ss="""JOB RESPONSIBILITIES: \r\n- Working with the Country Director to provide environmental information\r\nto the general public via regular electronic communications and serving\r\nas the primary local contact to Armenian NGOs and businesses and the\r\nArmenian offices of international organizations and agencies;\r\n- Helping to organize and prepare CENN seminars/ workshops;\r\n- Participating in defining the strategy and policy of CENN in Armenia,\r\nthe Caucasus region and abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally related field, or 5 years relevant\r\nexperience;\r\n- Oral and written fluency in Armenian, Russian and English;\r\n- Knowledge/ experience of working with environmental issues specific to\r\nArmenia is a plus.\r\nREMUNERATION:"""
pattern= re.compile(r"(?s)JOB RESPONSIBILITIES: .*")
print(pattern.match(ss).group(0))
output result is
JOB RESPONSIBILITIES:
- Working with the Country Director to provide environmental information
to the general public via regular electronic communications and serving
as the primary local contact to Armenian NGOs and businesses and the
Armenian offices of international organizations and agencies;
- Helping to organize and prepare CENN seminars/ workshops;
- Participating in defining the strategy and policy of CENN in Armenia,
the Caucasus region and abroad.
REQUIRED QUALIFICATIONS:
Additionally, you can set the Dot-all(or single-line) modifier through python re module's functions flag re.S like follows
pattern= re.compile(r"JOB RESPONSIBILITIES: .*",re.S)
For more information, please refer to re — Regular expression operations

Related

Python Regular Expression: re.sub to replace matches

I am trying to analyze an earnings call using python regular expression.
I want to delete unnecessary lines which only contain the name and position of the person, who is speaking next.
This is an excerpt of the text I want to analyze:
"Questions and Answers\nOperator [1]\n\n Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]\n I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.\n Timothy D. Cook, Apple Inc. - CEO & Director [3]\n ..."
At the end of each line that I want to delete, you have [some number].
So I used the following line of code to get these lines:
name_lines = re.findall('.*[\d]]', text)
This works and gives me the following list:
['Operator [1]',
' Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]',
' Timothy D. Cook, Apple Inc. - CEO & Director [3]']
So, now in the next step I want to replace this strings in the text using the following line of code:
for i in range(0,len(name_lines)):
text = re.sub(name_lines[i], '', text)
But this does not work. Also if I just try to replace 1 instead of using the loop it does not work, but I have no clue why.
Also if I try now to use re.findall and search for the lines I obtained from the first line of code I don`t get a match.
Try to use re.sub to replace the match:
import re
text = """\
Questions and Answers
Operator [1]
Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]
I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.
Timothy D. Cook, Apple Inc. - CEO & Director [3]"""
text = re.sub(r".*\d]", "", text)
print(text)
Prints:
Questions and Answers
I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.
The first argument to re.sub is treated as a regular expression, so the square brackets get a special meaning and don't match literally.
You don't need a regular expression for this replacement at all though (and you also don't need the loop counter i):
for name_line in name_lines:
text = text.replace(name_line, '')

How to find a specific, pre-defined word surrounded by any word(s) starting with a capital letter(s)?

I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
You can use 2 capture groups instead, and match a single word starting with a capital A-Z on the left or on the right.
Using [^\S\r\n] will match a whitespace char without a newline, as \s can match a newline
\b[A-Z]\w*[^\S\r\n]+(Test|Study)\b|\b(Test|Study)[^\S\r\n]+[A-Z]\w*
Regex demo
Ok, this is possibly way out of the actual scope but you could use the newer regex module with subroutines:
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)
See a demo on regex101.com (and mind the modifiers!).
In actual code, this could be:
import regex as re
junk = """
I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
"""
pattern = re.compile(r'''
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)''', re.VERBOSE)
for match in pattern.finditer(junk):
print(match.group(0))
And would yield
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
((?:[A-Z]\w+\s+){0,5}\bStudy\b\s*(?:[A-Z]\w+\b\s*){0,5})
Test
I have to further test it to check whether it works for the all of the possible scenarios in a real world. I might need to adjust '5' in the expression to a lower or higher number(s) to optimize my algorithm's performance, though. I tested it on some real datasets already and the results have been promising so far. It is fast.

Regex: marking a pattern

I'm trying to mark a sentence contains "manu" from it's nearest \n\n to it's nearest \n\n,
this is the text
\n\nHolds Certificate No: EMS 96453\nand operates an Environmental Management System which complies with the requirements of ISO for\n\nthe following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.\n\n/ tou\n\nFor and on behalf\n\n
I wanted to mark just this
the following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.
I tried this regex
\\n\\n(.+manu.+?)\\n\\n
but it's ignoring the nearest \n\n to my pattern and marks much more text than I want
Holds Certificate No: EMS 96453\nand operates an Environmental Management System which complies with the requirements of ISO for\n\nthe following scope:The Environmental Management System of Dow Corning, for management of environmental\nrisks associated with all global business processes for the marketing, developing,\n manufacturing, and supply of silicon-based and complementary products and services.
what am I missing?
The pattern starts at the left by matching \\n\\n followed by making use of the dot that matches any character. So it will match in this case manu without considering any characters in between.
You can use a pattern to match \\n\\n and make sure to not match it again before encountering manu
Then match until the first occurrence of \\n\\n after it, and capture the part that you want in a capture group.
\\n\\n((?:(?!\\n\\n).)+manu.+?)\\n\\n
Explanation
\\n\\n Match literally
( Capture group 1
(?:(?!\\n\\n).)+ Match any char asserting what is at the right is not \\n\\n
manu.+? Match manu followed by as least chars as possible
) Close group 1
\\n\\n Match literally
Regex demo
If you also want the match when it is either followed by \\n\\n or the end of the string:
\\n\\n((?:(?!\\n\\n).)+manu.+?)(?:\n\\n|$)
Regex demo

python regex sub not working due to escape sequence

I have this kind of text. -> Roberto is an insurance agent who sells two types of policies: a $$\$$50,000$$ policy and a $$\$$100,000$$ policy. Last month, his goal was to sell at least 57 insurance policies. While he did not meet his goal, the total value of the policies he sold was over $$\$$3,000,000$$. Which of the following systems of inequalities describes $$x$$, the possible number of $$\$$50,000$$ policies, and $$y$$, the possible number of $$\$$100,000$$ policies, that Roberto sold last month?
I want to replace expressions containing dollar signs such as $$\$$50,000$$. Removing things such as $$y$$ worked out quite well, but the expressions that contain escape sequence doesn't work well.
This is the code I used.
re.sub("$$\$$.*?$$", "", text)
This didn't work, and I found out that \ is a escape str, so should be written as \. So I replaced the expression as below.
re.sub("$$\\$$.*?$$", "", text)
However, this again didn't work. What am I doing wrong ? Thanks a lot in advance ...
The character $ is a regex metacharacter, and so will need to be escaped if intended to refer to a literal $:
text = """Roberto is an insurance agent who sells two types of policies: a $$\$$50,000$$ policy and a $$\$$100,000$$ policy. Last month, his goal was to sell at least 57 insurance policies. While he did not meet his goal, the total value of the policies he sold was over $$\$$3,000,000$$. Which of the following systems of inequalities describes $$x$$, the possible number of $$\$$50,000$$ policies, and $$y$$, the possible number of $$\$$100,000$$ policies, that Roberto sold last month?"""
output = re.sub(r'\$\$(?:\\\$\$)?.*?\$\$', '', text)
print(output)
The above pattern makes the \$$ optional, to cover all cases.

python regex negative lookahead method

I'm now extracting firm's name from the text data(10-k statement data).
I first tried using nltk StanfordTagger and extracted all the word tagged as organization. However, it quiet often failed to recall all the names of firms, and as I'm applying tagger to every single related sentence, it took such a long time.
So, I'm trying to extract all the words starting with Capital letter(or the words characters are all comprised of Capital letters).
So I find out that the regex below helpful.
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+
However, It cannot distinguish the name of segment from the name of firm.
For example,
sentence :
The Company's customers include, among others, Conner Peripherals Inc.("Conner"),
Maxtor Corporation ("Maxtor"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry.
I want to extract Conner Peripherals Inc, Conner, Maxtor Corporation, Maxtor, Applieds, but not 'Silicon Systems' since it is the name of segment.
So, I tried using
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+(?!segment|Segment)
However, it still extract 'Silicon Systems'.
Could you help me solving this problem?
(Or do you have any idea of how to extract only the firm's name from the text data?)
Thanks a lot!!!
You need to capture all consecutive texts! and then, mark individual words starting with caps as non-capturing(?:) so that you can capture consecutive words!
>>> re.findall("((?:[A-Z]+[a-zA-Z\-0-9']*\.?\s?)+)+?(?![Ss]egment)",sentence)
["The Company's ", 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation ', 'Maxtor', 'The ', 'Applieds ', '']
The NLTK approach, or any machine learning, seems to be a better approach here. I can only explain what the difficulty and current issue with the regex approach are.
The problem is that the matches expected can contain space separated phrases, and you want to avoid matching a certain phrase ending with segment. Even if you correct the negative lookahead as (?!\s*[Ss]egment), and make the pattern linear with something like \b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?(?!\s+[sS]egment), you will still match Silicon, a part of the unwanted match.
What you might try to do is to match all these entities and discard after matching, and only keep those entities in other contexts by capturing them into Group 1.
See the sample regex for this:
\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?\s+[sS]egment\b|(\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?)
Since it is unwieldy, you should think of building it from blocks, dynamically:
import re
entity_rx = r"\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?"
rx = r"{0}\s+[sS]egment\b|({0})".format(entity_rx)
s = "The Company's customers include, among others, Conner Peripherals Inc.(\"Conner\"), Maxtor Corporation (\"Maxtor\"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry."
matches = filter(None, re.findall(rx, s))
print(matches)
# => ['The Company', 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation', 'Maxtor', 'The', 'Applieds']
So,
\b - matches a word boundary
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
(?:\s+[A-Z][a-zA-Z0-9-]*)* - zero or more sequences of
\s+ - 1+ whitespaces
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
\b - trailing word boundary
\.? - an optional .
Then, this block is used to build
{0}\s+[sS]egment\b - the block we defined before followed with
\s+ - 1+ whitespaces
[sS]egment\b - either segment or Segment whole words
| - or
({0}) - Group 1 (what re.findall actually returns): the block we defined before.
filter(None, res) (in Python 2.x, in Python 3.x use list(filter(None, re.findall(rx, s)))) will filter out empty items in the final list.

Categories

Resources